Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest community_id processor deviates from the community-id-spec #105247

Closed
joegallo opened this issue Feb 7, 2024 · 6 comments
Closed

Ingest community_id processor deviates from the community-id-spec #105247

joegallo opened this issue Feb 7, 2024 · 6 comments
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@joegallo
Copy link
Contributor

joegallo commented Feb 7, 2024

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "1.2.3.4",
          "port": 1122
        },
        "destination": {
          "ip": "5.6.7.8",
          "port": 3344
        },
        "network": {
          "transport": "TCP"
        }
      }
    }
  ]
}

Results in 1:wCb3OG7yAFWelaUydu0D+125CLM=, which matches the first row in the table 'Default settings' table.

Similarly:

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "5.6.7.8",
          "port": 0
        },
        "destination": {
          "ip": "1.2.3.4",
          "port": 0
        },
        "network": {
          "iana_number": 1
        }
      }
    }
  ]
}

Results in 1:crodRHL2FEsHjbv3UkRrfbs4bZ0= in the middle of the table.

So we're all good on those.

However, changing the iana_number to 17 on that last example gives the following result:

{
  "docs" : [
    {
      "error" : {
        "root_cause" : [
          {
            "type" : "illegal_argument_exception",
            "reason" : "invalid source port [0]"
          }
        ],
        "type" : "illegal_argument_exception",
        "reason" : "invalid source port [0]"
      }
    }
  ]
}

Placeholder editorial comment: I'm not absolutely certain as I write this ticket right now whether it's correct for us to do this kind of validation at this point or not -- is it up to us at this stage to tell users that the data is bad (is it even bad?) or should we just process it?

Another case where we deviate from the documented spec is captured in #105247 (comment). I haven't debugged the issue there well enough to say why our value is different than the specified value, however.

My recollection is that the Elasticsearch implementation is a port of the beats implementation, so if we're wrong in these ways then they're probably wrong in these ways, too. When we figure out the ways in which we're wrong, we should make sure to raise that with them, too. At the very least it makes sense to add a few more cases to their test suite to confirm that they handle these cases correctly, but if they have the same issues we do then those issues should also be fixed so that our behaviors stay in sync. (See #55685 / #66534.)


An annoyance here is that fixing the broken behavior is probably a breaking change -- if you've been relying on the community ids that we've been providing, and those values have been consistently wrong, then when we fix the behavior you'll get new different values that while correct are now different than what you'd seen before. 🤷

@joegallo joegallo added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team labels Feb 7, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@joegallo joegallo changed the title Ingest community_id processor possibly deviates from the community-id-spec Ingest community_id processor deviates from the community-id-spec Feb 7, 2024
@joegallo
Copy link
Contributor Author

joegallo commented Feb 7, 2024

Here's a case where we deviate from the spec:

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "1.2.3.4",
          "port": 11
        },
        "destination": {
          "ip": "5.6.7.8",
          "port": 0
        },
        "network": {
          "iana_number": 1
        }
      }
    }
  ]
}

This results in 1:1vNr+cVw7bzZ+tGsh2H7voBwaaY=, but according to the spec the value should be 1:f/YiSyWqczrTgfUCZlBUnvHRcPk=.

So there's a clear case of some issue in our tests where we're not actually covering the specified behavior correctly. We should expand the tests to cover more of the community-id-spec, then fix our behavior to adhere to it.

@joegallo
Copy link
Contributor Author

joegallo commented Feb 7, 2024

Ah, here's something.

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "5.6.7.8",
          "port": 0
        },
        "destination": {
          "ip": "1.2.3.4",
          "port": 0
        },
        "network": {
          "iana_number": 1
        }
      }
    }
  ]
}

Correctly yields 1:crodRHL2FEsHjbv3UkRrfbs4bZ0=.

But:

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "1.2.3.4",
          "port": 8
        },
        "destination": {
          "ip": "5.6.7.8",
          "port": 0
        },
        "network": {
          "iana_number": 1
        }
      }
    }
  ]
}

Should also yield 1:crodRHL2FEsHjbv3UkRrfbs4bZ0= and instead yields 1:1vNr+cVw7bzZ+tGsh2H7voBwaaY=, so I suspect the issue is that we're not implementing the ICMP request/response pseudo port logic correctly. (But 🤷.)

@limotova
Copy link
Contributor

Hi, I took a look into this, I believe the community ID implementation is generally correct.

Addressing the first issue:
The third example is a UDP request (iana_number 17) with 0 and 0 as ports, which is invalid. The sample I believe you were referencing in the community id spec (on the 4th line) uses ports 3344 and 1122. When I run the following I get the correct ID

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "5.6.7.8",
          "port": 3344
        },
        "destination": {
          "ip": "1.2.3.4",
          "port": 1122
        },
        "network": {
          "iana_number": 17
        }
      }
    }
  ]
}

Results in 1:0Mu9InQx6z4ZiCZM/7HXi2WMhOg=

For the second and third issues I believe the cause of this is that when the protocol is ICMP/ICMP6, the Processor expects a separate icmp.type and icmp.code field (defaulting both to 0), and ignores port altogether.
I think this is reasonable behavior because the community-id specs distinguishes between ports and ICMP type/code (and in their high-level functions as well), but the baseline data drops the distinction, and internally they treat type/code like a port, which is where the confusion came from I think

When I move source.port to icmp.type, and destination.port to icmp.code I get the expected community IDs:

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "1.2.3.4"
        },
        "destination": {
          "ip": "5.6.7.8"
        },
        "network": {
          "iana_number": 1
        },
        "icmp": {
          "type": 11,
          "code": 0
        }
      }
    }
  ]
}

Results in: 1:f/YiSyWqczrTgfUCZlBUnvHRcPk=

And

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "1.2.3.4"
        },
        "destination": {
          "ip": "5.6.7.8"
        },
        "network": {
          "iana_number": 1
        },
        "icmp": {
          "type": 8,
          "code": 0
        }
      }
    }
  ]
}

Gives me 1:crodRHL2FEsHjbv3UkRrfbs4bZ0=

I think whether we also want to accept ports as icmp types/codes is something that warrants discussion, but I think the current behavior is also fairly reasonable.

@joegallo
Copy link
Contributor Author

Ahhhhhhhhhhhhhhhhhhhhh... I see it now. (Placeholder: actual thoughts to follow after I eat something.)

@joegallo
Copy link
Contributor Author

Okay, I was ambushed by a feral weekend, but I'm back.

I believe the cause of this is that when the protocol is ICMP/ICMP6, the Processor expects a separate icmp.type and icmp.code field (defaulting both to 0), and ignores port altogether.

Yes, precisely, you're quite right.

I think there's a docs enhancement to put through that'll tie this up rather nicely -- here are two documents representing an ICMP timestamp request and response, both have a community id of "1:9BH16h8N+en0tEsYCCxMA2p5hLo=".

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "1.2.3.4"
        },
        "destination": {
          "ip": "5.6.7.8"
        },
        "network": {
          "transport": "ICMP"
        },
        "icmp": {
          "type": 13,
          "code": 0
        }
      }
    }
  ]
}

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "community_id": {
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "source": {
          "ip": "5.6.7.8"
        },
        "destination": {
          "ip": "1.2.3.4"
        },
        "network": {
          "transport": "ICMP"
        },
        "icmp": {
          "type": 14,
          "code": 0
        }
      }
    }
  ]
}

We've gotten a couple of questions from users related to using "port 0" but I suspect that those users were making a mistake along the same lines as the one I was was making. Given that, I think we can close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

3 participants