Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(glue): add L2 resources for Database and Table #1988

Merged
merged 25 commits into from
Mar 14, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
3551d75
Add glue database and table
Mar 8, 2019
3de3b03
Add unit tests for database and schema
Mar 9, 2019
e4df159
Stash
Mar 9, 2019
502b0cb
Improve test coverage of table
Mar 11, 2019
a08dc37
Add integration tests and README
Mar 11, 2019
f4db178
Update README with types
Mar 11, 2019
0d10989
Add validation for name uniqueness and at least one column
Mar 11, 2019
cf6d56d
Use strongly named references
Mar 11, 2019
9dc8e57
Update StorageType enums to be enum-like classes
Mar 11, 2019
c7f62d7
Add SSE-S3 and SSE-KMS encryption support
Mar 12, 2019
e0043a9
Restrict s3 grants to only objects containing the table's prefix
Mar 12, 2019
231c36a
Add tsdocs for Type
Mar 12, 2019
cee2e46
Add Encryption to README
Mar 12, 2019
4595fde
Add CSE encryption and distinguish SSE-KMS from SSE-KMS-MANAGED
Mar 12, 2019
b8f886b
Minor fixes to the README
Mar 12, 2019
db1960b
Some more minor fixes to the README
Mar 12, 2019
d245b1c
Merge branch 'master' into samgood/glue
Mar 13, 2019
f98d5fd
Rename prefix to s3Prefix and use haveResource in tests
Mar 13, 2019
d3ccd53
Use string concatentation
Mar 13, 2019
538cff9
Add docs and fix string concatenation
Mar 13, 2019
0c2f7fa
Rename StorageType to DataFormat
Mar 13, 2019
1cb7c4f
Improve docs and make the TableEncryption enum more consistent with B…
Mar 13, 2019
a5d45f0
Refactor s3 bucket creation into separate function and support unencr…
Mar 13, 2019
30f8a3c
add test for CSE-KMS with an explicit bucket
Mar 13, 2019
45434fe
minor fixes to README
Mar 13, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 185 additions & 0 deletions packages/@aws-cdk/aws-glue/README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,187 @@
## The CDK Construct Library for AWS Glue
This module is part of the [AWS Cloud Development Kit](https://github.com/awslabs/aws-cdk) project.

### Database

A `Database` is a logical grouping of `Tables` in the Glue Catalog.

```ts
new glue.Database(stack, 'MyDatabase', {
databaseName: 'my_database'
});
```

By default, a S3 bucket is created and the Database is stored under `s3://<bucket-name>/`, but you can manually specify another location:

```ts
new glue.Database(stack, 'MyDatabase', {
databaseName: 'my_database',
locationUri: 's3://explicit-bucket/some-path/'
});
```

### Table

A Glue table describes a table of data in S3: its structure (column names and types), location of data (S3 objects with a common prefix in a S3 bucket), and format for the files (Json, Avro, Parquet, etc.):

```ts
new glue.Table(stack, 'MyTable', {
database: myDatabase,
tableName: 'my_table',
columns: [{
name: 'col1',
type: glue.Schema.string,
}, {
name: 'col2',
type: glue.Schema.array(Schema.string),
comment: 'col2 is an array of strings' // comment is optional
}]
dataFormat: glue.DataFormat.Json
});
```

By default, a S3 bucket will be created to store the table's data but you can manually pass the `bucket` and `s3Prefix`:

```ts
new glue.Table(stack, 'MyTable', {
bucket: myBucket,
s3Prefix: 'my-table/'
...
});
```

#### Partitions

To improve query performance, a table can specify `partitionKeys` on which data is stored and queried separately. For example, you might partition a table by `year` and `month` to optimize queries based on a time window:

```ts
new glue.Table(stack, 'MyTable', {
database: myDatabase,
tableName: 'my_table',
columns: [{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, if name is unique, why not use a hash?

Copy link
Contributor Author

@sam-goodwin sam-goodwin Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two semantics we want to model as strictly as we can: column uniqueness and ordering.

  • A hash models uniqueness well, but it does not model ordering. In node.js, the order of variables is the order in which they are added to the object, but that is not the case for other languages like java, where a developer would have to know to use a LinkedHashMap.
  • An array explicitly and intuitively defines the ordering in all languages, but it doesn't model column uniqueness.

I chose to statically model the ordering property with an array and check the uniqueness at runtime because then, at least the experience is consistent for all consumers. Using a hash might create confusion for consumers - they would not receive an error, the layout of their columns could just change arbitrarily.

name: 'col1',
type: glue.Schema.string
}],
partitionKeys: [{
sam-goodwin marked this conversation as resolved.
Show resolved Hide resolved
name: 'year',
type: glue.Schema.smallint
}, {
name: 'month',
type: glue.Schema.smallint
}],
dataFormat: glue.DataFormat.Json
});
```

### [Encryption](https://docs.aws.amazon.com/athena/latest/ug/encryption.html)

You can enable encryption on a Table's data:
* `Unencrypted` - files are not encrypted. The default encryption setting.
* [S3Managed](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html) - Server side encryption (`SSE-S3`) with an Amazon S3-managed key.
```ts
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.S3Managed
...
});
```
* [Kms](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (`SSE-KMS`) with an AWS KMS Key managed by the account owner.

```ts
// KMS key is created automatically
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.Kms
...
});

// with an explicit KMS key
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.Kms,
encryptionKey: new kms.EncryptionKey(stack, 'MyKey')
...
});
```
* [KmsManaged](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (`SSE-KMS`), like `Kms`, except with an AWS KMS Key managed by the AWS Key Management Service.
```ts
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.KmsManaged
...
});
```
* [ClientSideKms](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingClientSideEncryption.html#client-side-encryption-kms-managed-master-key-intro) - Client-side encryption (`CSE-KMS`) with an AWS KMS Key managed by the account owner.
```ts
// KMS key is created automatically
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.ClientSideKms
...
});

// with an explicit KMS key
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.ClientSideKms,
encryptionKey: new kms.EncryptionKey(stack, 'MyKey')
...
});
```

*Note: you cannot provide a `Bucket` when creating the `Table` if you wish to use server-side encryption (`Kms`, `KmsManaged` or `S3Managed`)*.
eladb marked this conversation as resolved.
Show resolved Hide resolved

### Types

A table's schema is a collection of columns, each of which have a `name` and a `type`. Types are recursive structures, consisting of primitive and complex types:

```ts
new glue.Table(stack, 'MyTable', {
columns: [{
name: 'primitive_column',
type: glue.Schema.string
}, {
name: 'array_column',
type: glue.Schema.array(glue.Schema.integer),
comment: 'array<integer>'
}, {
name: 'map_column',
type: glue.Schema.map(
glue.Schema.string,
glue.Schema.timestamp),
comment: 'map<string,string>'
}, {
name: 'struct_column',
type: glue.Schema.struct([{
name: 'nested_column',
type: glue.Schema.date,
comment: 'nested comment'
}]),
comment: "struct<nested_column:date COMMENT 'nested comment'>"
}],
...
```

#### Primitive

Numeric:
* `bigint`
* `float`
* `integer`
* `smallint`
* `tinyint`

Date and Time:
* `date`
* `timestamp`

String Types:

* `string`
* `decimal`
* `char`
* `varchar`

Misc:
* `boolean`
* `binary`

#### Complex

* `array` - array of some other type
* `map` - map of some primitive key type to any value type.
* `struct` - nested structure containing individually named and typed columns.
83 changes: 83 additions & 0 deletions packages/@aws-cdk/aws-glue/lib/data-format.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
/**
* Absolute class name of the Hadoop `InputFormat` to use when reading table files.
*/
export class InputFormat {
/**
* An InputFormat for plain text files. Files are broken into lines. Either linefeed or
* carriage-return are used to signal end of line. Keys are the position in the file, and
* values are the line of text.
*
* @see https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/TextInputFormat.html
*/
public static readonly TextInputFormat = new InputFormat('org.apache.hadoop.mapred.TextInputFormat');

constructor(public readonly className: string) {}
}

/**
* Absolute class name of the Hadoop `OutputFormat` to use when writing table files.
*/
export class OutputFormat {
/**
* Writes text data with a null key (value only).
*
* @see https://hive.apache.org/javadocs/r2.2.0/api/org/apache/hadoop/hive/ql/io/HiveIgnoreKeyTextOutputFormat.html
*/
public static readonly HiveIgnoreKeyTextOutputFormat = new OutputFormat('org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat');

constructor(public readonly className: string) {}
}

/**
* Serialization library to use when serializing/deserializing (SerDe) table records.
*
* @see https://cwiki.apache.org/confluence/display/Hive/SerDe
*/
export class SerializationLibrary {
/**
* @see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-JSON
*/
public static readonly HiveJson = new SerializationLibrary('org.apache.hive.hcatalog.data.JsonSerDe');

/**
* @see https://github.com/rcongiu/Hive-JSON-Serde
*/
public static readonly OpenXJson = new SerializationLibrary('org.openx.data.jsonserde.JsonSerDe');

constructor(public readonly className: string) {}
}

/**
* Defines the input/output formats and ser/de for a single DataFormat.
*/
export interface DataFormat {
/**
* `InputFormat` for this data format.
*/
inputFormat: InputFormat;

/**
* `OutputFormat` for this data format.
*/
outputFormat: OutputFormat;

/**
* Serialization library for this data format.
*/
serializationLibrary: SerializationLibrary;
}

export namespace DataFormat {
/**
* Stored as plain text files in JSON format.
*
* Uses OpenX Json SerDe for serialization and deseralization.
*
* @see https://docs.aws.amazon.com/athena/latest/ug/json.html
*/
export const Json: DataFormat = {
inputFormat: InputFormat.TextInputFormat,
outputFormat: OutputFormat.HiveIgnoreKeyTextOutputFormat,
serializationLibrary: SerializationLibrary.OpenXJson
};
}
Loading