Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing of ignore_above for keyword #105

Closed
vbohata opened this issue Aug 23, 2018 · 13 comments · Fixed by #2424
Closed

Increasing of ignore_above for keyword #105

vbohata opened this issue Aug 23, 2018 · 13 comments · Fixed by #2424
Assignees
Labels

Comments

@vbohata
Copy link

vbohata commented Aug 23, 2018

For some keywords like url.query 1024 ignore_above is not long enough. So either the default ignore_above should be increased (8192 seems to be safe for Lucene) or for some fields the length should be increased.

@ruflin
Copy link
Contributor

ruflin commented Aug 24, 2018

The current 1024 is coming from Beats. I was almost commenting so far it rarely did cause any problems expect in some edge case. And now just a few moments ago one of these edge cases popped up: elastic/beats#8076

An other option would be to not have a default and only set it on some specific fields for ECS?

@webmat webmat mentioned this issue Sep 18, 2018
26 tasks
@webmat
Copy link
Contributor

webmat commented Sep 18, 2018

Hmmm good point. Funny thing is I always assumed that ignore_above truncated the value that would be added to the keyword index (but not in the _source, obviously), if it went above.

I just double-checked with the following, and confirmed that I was wrong:

PUT ignore
{ "mappings": {
    "_doc": {
      "properties": {
        "txt": {
          "type":   "keyword",
          "ignore_above": 8
}}}}}
PUT ignore/_doc/1
{ "txt": "short" }
PUT ignore/_doc/2
{ "txt": "quite a bit too long" }
GET ignore/_search
{ "aggs" : {
    "texts" : {
      "terms" : { "field" : "txt" }
    }
  }
}

The aggregation returns only one bucket (but still reports hits.total = 2, funny enough):

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "ignore",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "txt": "quite a bit too long"
        }
      },
      {
        "_index": "ignore",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "txt": "short"
        }
      }
    ]
  },
  "aggregations": {
    "texts": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "short",
          "doc_count": 1
        }
      ]
    }
  }
}

@webmat
Copy link
Contributor

webmat commented Sep 18, 2018

So in short, I agree we need to increase ignore_above here at the very least. Perhaps in other places as well.

Perhaps as Nic is saying, we should even remove it... This one requires more thought. I would definitely keep it in any place that could be a vector of attack (e.g. attacker can crafting an extremely long URL). It's potentially ok to remove in any field that would be filled by internal systems, as opposed to user generated values.

@ruflin
Copy link
Contributor

ruflin commented Sep 24, 2018

What can be used as an attack seems to me outside of the scope of ECS and more specific to each use case / implementation. That is why I think removing it would still a good option.

@webmat
Copy link
Contributor

webmat commented Sep 24, 2018

Changing the value for ignore_above over time will not cause a breaking change. So if we want to go with removing it altogether for now, I can go with that. It can be added again later, if needed, without issue.

@ruflin
Copy link
Contributor

ruflin commented Sep 25, 2018

Let's do that. The main challenge here is it requires a change on the Beats code side :-(

@vbohata
Copy link
Author

vbohata commented Sep 25, 2018

Be careful, simply removing it can cause problems with multi fields. Here keyword should be set to max safe value.

@ruflin
Copy link
Contributor

ruflin commented Sep 28, 2018

My thinking here is that this will be a responsibility of the operator to set ignore_above on some fields which he thinks should have one.

@webmat
Copy link
Contributor

webmat commented Sep 28, 2018

I agree in principle. I'd like to see how this is done in practice, however.

@ruflin ruflin mentioned this issue Oct 31, 2018
22 tasks
@neu5ron
Copy link

neu5ron commented Mar 20, 2019

great discussion! I think this is a lot of times missed in security logging use cases!
Just some more fuel for moving forward on this., I see its bumped to 1024, but I am wondering if we can make an exception to go to atleast 4096?
This is something in production that we tackled many moons ago - we did two things:

  1. increased ignore_above to somewhere around 7,000+
  2. calculated length of any fields that this scenario could happen.. which in HTTP is about every header (client and server), dns answer, dns query, command line, process, file names, user agents (http, sip, smtp), many SSL stuff. registry keys, etc..........
    Here is in the wilds for URI's
    github-bro-http-uri-length

@webmat
Copy link
Contributor

webmat commented Mar 21, 2019

Yes, this is definitely something we have on our mind to fix for ECS 1.1. We'll be going over many things and cleaning up the details :-)

Also, thanks for that screenshot. Are the top 3 values legit URIs, or an attempt at a buffer overflow? 😱 😆

Note that the meaning of ignore_above is to stop indexing a field beyond a given length. However the full original field would still be present in the event. It just wouldn't turn up on a search for that keyword field:

  • A filter on that exact value will fail to return that document
  • A prefix search (e.g. for autocomplete) wouldn't consider this value

@neu5ron
Copy link

neu5ron commented Mar 21, 2019 via email

@webmat
Copy link
Contributor

webmat commented Mar 21, 2019

Yes, good point. Although who has room to display a 500k long value in an aggregation? :trollface:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants