feat(ingest/elasticsearch) Add support for api_key authentication #11473

th0ger · 2024-09-25T10:00:15Z

Is your feature request related to a problem? Please describe.

Add the ability to use API key authentication for Elasticsearch sources, as an alternative to username/password and certificate-based auth.
Proposing to add optional api_key (string | tuple) config parameter.

Describe the solution you'd like

Elasticsearch accepts API key authentication via http headers in the format: "Authorization: ApiKey [ENCODED]"

The [ENCODED] key is provided by Elasticsearch during API key creation (internally it's a a base64 combination of [ID]: [API_KEY]).
The Datahub Elasticsearch Source is implemented via elasticsearch-py 7.x (currently pinned version elasticsearch==7.13.4). This passes any auth kwargs via Transport onto the default connection class Urllib3HttpConnection that accepts:
"api_key – optional API Key authentication as either base64 encoded string or a tuple."

Note that elasticsearch-py latest 8.15.1 takes the api_key (str | Tuple[str, str] | None) directly, which is compatible.

I checked in the source (_get_api_key_header_val) what they actually mean by "string or tuple", and it corresponds to the web-docs above. I.e. api_key requires either str [ENCODED] OR tuple ([ID], [API_KEY]).

I propose the following datahub config format:

Field: api_key (str | array), optional.
Description: "API Key authentication. Accepts either a list with id and api_key (UTF-8 representation), or a base64 encoded string of id and api_key combined by ':'. "

Since YAML doesn't accept tuples, we need to accept a list and convert it to a python tuple.
From the looks of jsonschema and pydantic it should be possible to have a mixed type. Not sure if the datahub web documentation likes mixed types (anyone?). If it turns out not possible, we can just do encoded str.

Describe alternatives you've considered
NONE

Additional context
Tests: Datahub does not currently have integration tests for elasticsearch, and e.g. basic http_auth is currently also not tested. Besides setting up elasticsearch with docker-compose, testing api_key requires one to first create the api_key before testing it.
Creating tests are a much larger task than the feat itself, so needs clarification of scope.

The text was updated successfully, but these errors were encountered:

th0ger · 2024-09-25T10:03:27Z

@hsheth2 Please review above spec and provide feedback on the 1) mixed types and 2) scope of required tests.

I can make a PR asap with the feat itself, but need more feedback for the tests.

th0ger · 2024-09-25T13:48:26Z

I just tested the the support for pydantic mixed Union and Tuple types.
Union is already used by datahub in various places.

api_key: Optional[Union[Tuple[str,str], str]] is handled well by Pydantic (it generates a jsonschema with anyOf and a list of fixed length 2):

But the Docusaurus table view have issues handling the Tuple type:

th0ger · 2024-09-25T20:08:13Z

The reason for the missing table in the generated docs are caused by incompatible jsonschema versions.

The ./scripts/docgen.sh is validating against https://json-schema.org/draft/2020-12/schema and fails validation.

The full generated jsonschema is pasted at the bottom of this post. You can put it into e.g. JSON Schema Validator and find that it validates OK for all versions up to https://json-schema.org/draft/2019-09/schema but indeed fails https://json-schema.org/draft/2020-12/schema.

The key part that's failing draft/2020-12 is the Tuple construct:

{
"type": "object",
"properties": {
  "api_key": {
    "anyOf": [
      {
        "type": "array",
        "minItems": 2,
        "maxItems": 2,
        "items": [
          {
            "type": "string"
          },
          {
            "type": "string"
          }
        ]
      },
      {
        "type": "string"
      }
    ]
  }
}
}

Indeed, draft/2020-12 introduced breaking changes-to-items-and-additionalitems:

The keywords used for defining arrays and tuples have been redesigned [...].
The items and additionalItems keywords have been replaced with prefixItems and items [...].
Although the meaning of items has changed, the syntax for defining arrays remains the same. Only the syntax for defining tuples has changed.

In summary: I think Datahub generates json schemas compatible with draft/2019-09 or below (not sure which and how), but attempts validation against draft/2020-12. This bug surfaces the water when using advanced constructs like tuples.

Full JSON Schema for elastisearch with api_key union/tuple field

{
"title": "ElasticsearchSourceConfig",
"description": "Any source that connects to a platform should inherit this class",
"type": "object",
"properties": {
"env": {
  "title": "Env",
  "description": "The environment that all assets produced by this connector belong to",
  "default": "PROD",
  "type": "string"
},
"platform_instance": {
  "title": "Platform Instance",
  "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://datahubproject.io/docs/platform-instances/ for more details.",
  "type": "string"
},
"host": {
  "title": "Host",
  "description": "The elastic search host URI.",
  "default": "localhost:9200",
  "type": "string"
},
"username": {
  "title": "Username",
  "description": "The username credential.",
  "type": "string"
},
"password": {
  "title": "Password",
  "description": "The password credential.",
  "type": "string"
},
"api_key": {
  "title": "Api Key",
  "description": "API Key authentication. Accepts either a list with id and api_key (UTF-8 representation), or a base64 encoded string of id and api_key combined by ':'.",
  "anyOf": [
    {
      "type": "array",
      "minItems": 2,
      "maxItems": 2,
      "items": [
        {
          "type": "string"
        },
        {
          "type": "string"
        }
      ]
    },
    {
      "type": "string"
    }
  ]
},
"use_ssl": {
  "title": "Use Ssl",
  "description": "Whether to use SSL for the connection or not.",
  "default": false,
  "type": "boolean"
},
"verify_certs": {
  "title": "Verify Certs",
  "description": "Whether to verify SSL certificates.",
  "default": false,
  "type": "boolean"
},
"ca_certs": {
  "title": "Ca Certs",
  "description": "Path to a certificate authority (CA) certificate.",
  "type": "string"
},
"client_cert": {
  "title": "Client Cert",
  "description": "Path to the file containing the private key and the certificate, or cert only if using client_key.",
  "type": "string"
},
"client_key": {
  "title": "Client Key",
  "description": "Path to the file containing the private key if using separate cert and key files.",
  "type": "string"
},
"ssl_assert_hostname": {
  "title": "Ssl Assert Hostname",
  "description": "Use hostname verification if not False.",
  "default": false,
  "type": "boolean"
},
"ssl_assert_fingerprint": {
  "title": "Ssl Assert Fingerprint",
  "description": "Verify the supplied certificate fingerprint if not None.",
  "type": "string"
},
"url_prefix": {
  "title": "Url Prefix",
  "description": "There are cases where an enterprise would have multiple elastic search clusters. One way for them to manage is to have a single endpoint for all the elastic search clusters and use url_prefix for routing requests to different clusters.",
  "default": "",
  "type": "string"
},
"index_pattern": {
  "title": "Index Pattern",
  "description": "regex patterns for indexes to filter in ingestion.",
  "default": {
    "allow": [
      ".*"
    ],
    "deny": [
      "^_.*",
      "^ilm-history.*"
    ],
    "ignoreCase": true
  },
  "allOf": [
    {
      "$ref": "#/definitions/AllowDenyPattern"
    }
  ]
},
"ingest_index_templates": {
  "title": "Ingest Index Templates",
  "description": "Ingests ES index templates if enabled.",
  "default": false,
  "type": "boolean"
},
"index_template_pattern": {
  "title": "Index Template Pattern",
  "description": "The regex patterns for filtering index templates to ingest.",
  "default": {
    "allow": [
      ".*"
    ],
    "deny": [
      "^_.*"
    ],
    "ignoreCase": true
  },
  "allOf": [
    {
      "$ref": "#/definitions/AllowDenyPattern"
    }
  ]
},
"profiling": {
  "title": "Profiling",
  "description": "Configs to ingest data profiles from ElasticSearch.",
  "allOf": [
    {
      "$ref": "#/definitions/ElasticProfiling"
    }
  ]
},
"collapse_urns": {
  "title": "Collapse Urns",
  "description": "List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN.\n        The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats.\n        e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs.",
  "allOf": [
    {
      "$ref": "#/definitions/CollapseUrns"
    }
  ]
}
},
"additionalProperties": false,
"definitions": {
"AllowDenyPattern": {
  "title": "AllowDenyPattern",
  "description": "A class to store allow deny regexes",
  "type": "object",
  "properties": {
    "allow": {
      "title": "Allow",
      "description": "List of regex patterns to include in ingestion",
      "default": [
        ".*"
      ],
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "deny": {
      "title": "Deny",
      "description": "List of regex patterns to exclude from ingestion.",
      "default": [],
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "ignoreCase": {
      "title": "Ignorecase",
      "description": "Whether to ignore case sensitivity during pattern matching.",
      "default": true,
      "type": "boolean"
    }
  },
  "additionalProperties": false
},
"OperationConfig": {
  "title": "OperationConfig",
  "type": "object",
  "properties": {
    "lower_freq_profile_enabled": {
      "title": "Lower Freq Profile Enabled",
      "description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
      "default": false,
      "type": "boolean"
    },
    "profile_day_of_week": {
      "title": "Profile Day Of Week",
      "description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
      "type": "integer"
    },
    "profile_date_of_month": {
      "title": "Profile Date Of Month",
      "description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
      "type": "integer"
    }
  },
  "additionalProperties": false
},
"ElasticProfiling": {
  "title": "ElasticProfiling",
  "type": "object",
  "properties": {
    "enabled": {
      "title": "Enabled",
      "description": "Whether to enable profiling for the elastic search source.",
      "default": false,
      "type": "boolean"
    },
    "operation_config": {
      "title": "Operation Config",
      "description": "Experimental feature. To specify operation configs.",
      "allOf": [
        {
          "$ref": "#/definitions/OperationConfig"
        }
      ]
    }
  },
  "additionalProperties": false
},
"CollapseUrns": {
  "title": "CollapseUrns",
  "type": "object",
  "properties": {
    "urns_suffix_regex": {
      "title": "Urns Suffix Regex",
      "description": "List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN.\n        The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats.\n        e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs.",
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "additionalProperties": false
}
}
}

th0ger · 2024-09-30T07:05:48Z

@hsheth2 Thanks for taking in this change 🥳

th0ger mentioned this issue Sep 25, 2024

feat(ingest/elasticsearch): add api_key auth #11475

Merged

5 tasks

hsheth2 closed this as completed in #11475 Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/elasticsearch) Add support for api_key authentication #11473

feat(ingest/elasticsearch) Add support for api_key authentication #11473

th0ger commented Sep 25, 2024

th0ger commented Sep 25, 2024

th0ger commented Sep 25, 2024

th0ger commented Sep 25, 2024 •

edited

Loading

th0ger commented Sep 30, 2024

feat(ingest/elasticsearch) Add support for api_key authentication #11473

feat(ingest/elasticsearch) Add support for api_key authentication #11473

Comments

th0ger commented Sep 25, 2024

th0ger commented Sep 25, 2024

th0ger commented Sep 25, 2024

th0ger commented Sep 25, 2024 • edited Loading

th0ger commented Sep 30, 2024

th0ger commented Sep 25, 2024 •

edited

Loading