Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(glue): support partition index on tables #17998

Merged
merged 11 commits into from
Dec 29, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 47 additions & 1 deletion packages/@aws-cdk/aws-glue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ new glue.Table(this, 'MyTable', {

By default, an S3 bucket will be created to store the table's data and stored in the bucket root. You can also manually pass the `bucket` and `s3Prefix`:

### Partitions
### Partition Keys

To improve query performance, a table can specify `partitionKeys` on which data is stored and queried separately. For example, you might partition a table by `year` and `month` to optimize queries based on a time window:

Expand All @@ -218,6 +218,52 @@ new glue.Table(this, 'MyTable', {
});
```

### Partition Indexes

Another way to improve query performance is to specify partition indexes. If no partition indexes are
present on the table, AWS Glue loads all partitions of the table and filters the loaded partitions using
the query expression. The query takes more time to run as the number of partitions increase. With an
index, the query will try to fetch a subset of the partitions instead of loading all partitions of the
table.

The keys of a partition index must be a subset of the partition keys of the table. You can have a
maximum of 3 partition indexes per table. To specify a partition index, you can use the `partitionIndexes`
property:

```ts
declare const myDatabase: glue.Database;
new glue.Table(this, 'MyTable', {
database: myDatabase,
tableName: 'my_table',
columns: [{
name: 'col1',
type: glue.Schema.STRING,
}],
partitionKeys: [{
name: 'year',
type: glue.Schema.SMALL_INT,
}, {
name: 'month',
type: glue.Schema.SMALL_INT,
}],
partitionIndexes: [{
indexName: 'my-index', // optional
keyNames: ['year'],
}], // supply up to 3 indexes
dataFormat: glue.DataFormat.JSON,
});
```

Alternatively, you can call the `addPartitionIndex()` function on a table:

```ts
declare const myTable: glue.Table;
myTable.addPartitionIndex({
indexName: 'my-index',
keyNames: ['year'],
});
```

## [Encryption](https://docs.aws.amazon.com/athena/latest/ug/encryption.html)

You can enable encryption on a Table's data:
Expand Down
132 changes: 130 additions & 2 deletions packages/@aws-cdk/aws-glue/lib/table.ts
Original file line number Diff line number Diff line change
@@ -1,13 +1,33 @@
import * as iam from '@aws-cdk/aws-iam';
import * as kms from '@aws-cdk/aws-kms';
import * as s3 from '@aws-cdk/aws-s3';
import { ArnFormat, Fn, IResource, Resource, Stack } from '@aws-cdk/core';
import { ArnFormat, Fn, IResource, Names, Resource, Stack } from '@aws-cdk/core';
import * as cr from '@aws-cdk/custom-resources';
import { AwsCustomResource } from '@aws-cdk/custom-resources';
import { Construct } from 'constructs';
import { DataFormat } from './data-format';
import { IDatabase } from './database';
import { CfnTable } from './glue.generated';
import { Column } from './schema';

/**
* Properties of a Partition Index.
*/
export interface PartitionIndex {
/**
* The name of the partition index.
*
* @default - a name will be generated for you.
*/
readonly indexName?: string;

/**
* The partition key names that comprise the partition
* index. The names must correspond to a name in the
* table's partition keys.
*/
readonly keyNames: string[];
}
export interface ITable extends IResource {
/**
* @attribute
Expand Down Expand Up @@ -102,7 +122,16 @@ export interface TableProps {
*
* @default table is not partitioned
*/
readonly partitionKeys?: Column[]
readonly partitionKeys?: Column[];

/**
* Partition indexes on the table. A maximum of 3 indexes
* are allowed on a table. Keys in the index must be part
* of the table's partition keys.
*
* @default table has no partition indexes
*/
readonly partitionIndexes?: PartitionIndex[];

/**
* Storage type of the table's data.
Expand Down Expand Up @@ -230,6 +259,18 @@ export class Table extends Resource implements ITable {
*/
public readonly partitionKeys?: Column[];

/**
* This table's partition indexes.
*/
public readonly partitionIndexes?: PartitionIndex[];

/**
* Partition indexes must be created one at a time. To avoid
* race conditions, we store the resource and add dependencies
* each time a new partition index is created.
*/
private partitionIndexCustomResources: AwsCustomResource[] = [];

constructor(scope: Construct, id: string, props: TableProps) {
super(scope, id, {
physicalName: props.tableName,
Expand Down Expand Up @@ -287,6 +328,77 @@ export class Table extends Resource implements ITable {
resourceName: `${this.database.databaseName}/${this.tableName}`,
});
this.node.defaultChild = tableResource;

// Partition index creation relies on created table.
if (props.partitionIndexes) {
this.partitionIndexes = props.partitionIndexes;
this.partitionIndexes.forEach((index) => this.addPartitionIndex(index));
}
}

/**
* Add a partition index to the table. You can have a maximum of 3 partition
* indexes to a table. Partition index keys must be a subset of the table's
* partition keys.
*
* @see https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html
*/
public addPartitionIndex(index: PartitionIndex) {
const numPartitions = this.partitionIndexCustomResources.length;
if (numPartitions >= 3) {
throw new Error('Maximum number of partition indexes allowed is 3');
}
this.validatePartitionIndex(index);

const indexName = index.indexName ?? this.generateIndexName(index.keyNames);
const partitionIndexCustomResource = new cr.AwsCustomResource(this, `partition-index-${indexName}`, {
onCreate: {
service: 'Glue',
action: 'createPartitionIndex',
parameters: {
DatabaseName: this.database.databaseName,
TableName: this.tableName,
PartitionIndex: {
IndexName: indexName,
Keys: index.keyNames,
},
},
physicalResourceId: cr.PhysicalResourceId.of(
indexName,
),
},
policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE,
}),
});
this.grantToUnderlyingResources(partitionIndexCustomResource, ['glue:UpdateTable']);

// Depend on previous partition index if possible, to avoid race condition
if (numPartitions > 0) {
this.partitionIndexCustomResources[numPartitions-1].node.addDependency(partitionIndexCustomResource);
}
this.partitionIndexCustomResources.push(partitionIndexCustomResource);
}

private generateIndexName(keys: string[]): string {
const prefix = keys.join('-') + '-';
const uniqueId = Names.uniqueId(this);
const maxIndexLength = 80; // arbitrarily specified
const startIndex = Math.max(0, uniqueId.length - (maxIndexLength - prefix.length));
return prefix + uniqueId.substring(startIndex);
}

private validatePartitionIndex(index: PartitionIndex) {
if (index.indexName !== undefined && (index.indexName.length < 1 || index.indexName.length > 255)) {
throw new Error(`Index name must be between 1 and 255 characters, but got ${index.indexName.length}`);
}
if (!this.partitionKeys || this.partitionKeys.length === 0) {
throw new Error('The table must have partition keys to create a partition index');
}
const keyNames = this.partitionKeys.map(pk => pk.name);
if (!index.keyNames.every(k => keyNames.includes(k))) {
throw new Error(`All index keys must also be partition keys. Got ${index.keyNames} but partition key names are ${keyNames}`);
}
}

/**
Expand Down Expand Up @@ -336,6 +448,22 @@ export class Table extends Resource implements ITable {
});
}

/**
* Grant the given identity custom permissions to ALL underlying resources of the table.
* Permissions will be granted to the catalog, the database, and the table.
*/
public grantToUnderlyingResources(grantee: iam.IGrantable, actions: string[]) {
return iam.Grant.addToPrincipal({
grantee,
resourceArns: [
this.tableArn,
this.database.catalogArn,
this.database.databaseArn,
Comment on lines +459 to +461
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a source for these permissions? What does glue:UpdateTable mean for a table vs catalog vs database?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly from my own investigations (nothing documented that I could find).

Here's what I get with no permissions:

Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-1GZIV4NJQYJJ1/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-89Z
BzvCbg7Oi is not authorized to perform: glue:UpdateTable on resource: arn:aws:g
lue:us-east-1:489318732371:catalog (RequestId: ab6c2467-9c78-4d23-be59-953e4ab2
3144)

Here's what I get after I add glue:UpdateTable with permissions to the catalog:

Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-7U8A8EMIKP4E/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-nlw0
ulljEHZq is not authorized to perform: glue:UpdateTable on resource: arn:aws:gl
ue:us-east-1:489318732371:database/my_database (RequestId: d4caefa8-30d2-4b02-9
9e4-6b1305ab5aea)

So I then add the same permission to the database and I get:

Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-1N8TEEWC845G7/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-z2X
SZ4Zjykso is not authorized to perform: glue:UpdateTable on resource: arn:aws:g
lue:us-east-1:489318732371:table/my_database/json_table (RequestId: aa73a7ff-0f
f8-4c20-8338-b0f587298d6e)

The only valid permission is for the catalog, database, and table.

],
actions,
});
}

private getS3PrefixForGrant() {
return this.s3Prefix + '*';
}
Expand Down
2 changes: 2 additions & 0 deletions packages/@aws-cdk/aws-glue/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@
"@aws-cdk/aws-s3": "0.0.0",
"@aws-cdk/aws-s3-assets": "0.0.0",
"@aws-cdk/core": "0.0.0",
"@aws-cdk/custom-resources": "0.0.0",
"constructs": "^3.3.69"
},
"homepage": "https://github.com/aws/aws-cdk",
Expand All @@ -113,6 +114,7 @@
"@aws-cdk/aws-s3": "0.0.0",
"@aws-cdk/aws-s3-assets": "0.0.0",
"@aws-cdk/core": "0.0.0",
"@aws-cdk/custom-resources": "0.0.0",
"constructs": "^3.3.69"
},
"engines": {
Expand Down
Loading