-
Notifications
You must be signed in to change notification settings - Fork 300
Features
Anthony Virtuoso edited this page Nov 16, 2019
·
2 revisions
For a complete and up to date list see the Athena Query Federation SDK README.md. A summary of the most prominent features can be found below.
- Federated Metadata - It is not always practical to centralize table metadata in a centralized meta-store. As such, this SDK allows Athena to delegate portions of its query planning to your connector in order to retrieve metadata about your data source.
- Federated Scalar Batch Functions - If you have a transformation that can be applied to a row and produce a new column value, you can use Athena Query Federation to add your own custom User Defined Function. Some common examples are: decrypting an encrypted column and masking the result value, resolving an IP Address into a geolocation.
- Glue DataCatalog Support - You can optionally enable a pre-built Glue MetadataHandler in your connector which will first attempt to fetch metadata from Glue about any table being queried before given you an opportunitiy to modify or re-write the retrieved metadata. This can be handy when you are using a custom format it S3 or if your data source doesn't have its own source of metadata (e.g. redis).
- AWS Secrets Manager Integration - If your connectors need passwords or other sensitive information, you can optionally use the SDK's built in tooling to resolve secrets. For example, if you have a config with a jdbc connection string you can do: "jdbc://${username}:${password}@hostname:post?options" and the SDK will automatically replace ${username} and ${password} with AWS Secrets Manager secrets of the same name.
- Federated Identity - When Athena federates a query to your connector, you may want to perform Authz based on the identity of the entity that executed the Athena Query.
- Partition Pruning - Athena will call you connector to understand how the table being queried is partitioned as well as to obtain which partitions need to be read for a given query. If your source supports partitioning, this give you an opportunity to use the query predicate to perform partition prunning.
- Parallelized & Pipelined Reads - Athena will parallelize reading your tables based on the partitioning information you provide. You also have the opportunity to tell Athena how (and if) it should split each partition into multiple (potentially concurrent) read operations. Behind the scenes Athena will parallelize reading the split (work units) you've created and pipeline reads to reduce the performance impact of reading a remote source.
- Predicate Pushdown - (Associative Predicates) Where relevant, Athena will supply you with the associative portion of the query predicate so that you can perform filtering or push the predicate into your source system for even better performance. It is important to note that the predicate is not always the query's full predicate. For example, if the query's predicate was "where (col0 < 1 or col1 < 10) and col2 + 10 < 100" only the "col0 < 1 or col1 < 10" will be supplied to you at this time. We are still considering the best form for supplying connectors with a more complete view of the query and its predicate and expect a future release to provide this to connectors that are capable of utilizing
- Column Projection - Where relevant, Athena will supply you with the columns that need to be projected so that you can reduce data scanned.
- Limited Scans - While Athena is not yet able to push down limits to you connector, the SDK does expose a mechanism by which you can abandon a scan early. Athena will already avoid scanning partitions and splits that are not needed once a limit, failure, or user cancellation occurs but this functionality will allow connectors that are in the middle of processing a split to stop regardless of the cause. This works even when the query's limit can not be semantically pushed down (e.g. limit happens after a filtered join). In a future release we may also introduce traditional limit pushdwon for the simple cases that would support that.
- Congestion Control - Some of the source you may wish to federate to may not be as scalable as Athena or may be running performance sensitive workloads that you wish to protect from an overzealous federated query. Athena will automatically detect congestion by listening for FederationThrottleException(s) as well as many other AWS service exceptions that indicate your source is overwhelmed. When Athena detects congestion it reducing parallelism against your source. Within the SDK you can make use of ThrottlingInvoker to more tightly control congestion yourself. Lastly, you can reduce the concurrency your Lambda functions are allowed to achieve in the Lambda console and Athena will respect that setting.