A Terraform module which deploys a Snowplow Postgres Loader application on AWS running on top of EC2. If you want to use a custom AMI for this deployment you will need to ensure it is based on top of Amazon Linux 2.
WARNING: If you are upgrading from module version 0.1.x you will need to issue a manual table update - details can be found here. You will need to adjust the alter table command with the schema that your events
table is deployed within.
This module by default collects and forwards telemetry information to Snowplow to understand how our applications are being used. No identifying information about your sub-account or account fingerprints are ever forwarded to us - it is very simple information about what modules and applications are deployed and active.
If you wish to subscribe to our mailing list for updates to these modules or security advisories please set the user_provided_id
variable to include a valid email address which we can reach you at.
To disable telemetry simply set variable telemetry_enabled = false
.
For details on what information is collected please see this module: https://github.com/snowplow-devops/terraform-snowplow-telemetry
The Postgres Loader can load both your enriched and bad data into a Postgres database - by default we are using RDS as it affords a simple and cost effective way to get started.
To start loading "enriched" data into Postgres:
module "enriched_stream" {
source = "snowplow-devops/kinesis-stream/aws"
version = "0.2.0"
name = "enriched-stream"
}
module "pipeline_rds" {
source = "snowplow-devops/rds/aws"
version = "0.2.0"
name = "pipeline-rds"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
db_name = local.pipeline_db_name
db_username = local.pipeline_db_username
db_password = local.pipeline_db_password
# Note: this exposes your data to the internet - take care to ensure your allowlist is strict enough
# or provide a way to access the database through the VPC instead
publicly_accessible = true
additional_ip_allowlist = local.pipeline_ip_allowlist
}
module "postgres_loader_enriched" {
source = "snowplow-devops/postgres-loader-kinesis-ec2/aws"
name = "postgres-loader-enriched-server"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
in_stream_name = module.enriched_stream.name
# Note: The purpose defines what the input data set should look like
purpose = "ENRICHED_EVENTS"
# Note: This schema is created automatically by the VM on launch
schema_name = "atomic"
ssh_key_name = "your-key-name"
ssh_ip_allowlist = ["0.0.0.0/0"]
# Linking in the custom Iglu Server here
custom_iglu_resolvers = [
{
name = "Iglu Server"
priority = 0
uri = "http://your-iglu-server-endpoint/api"
api_key = var.iglu_super_api_key
vendor_prefixes = []
}
]
db_sg_id = module.pipeline_rds.sg_id
db_host = module.pipeline_rds.address
db_port = module.pipeline_rds.port
db_name = local.pipeline_db_name
db_username = local.pipeline_db_username
db_password = local.pipeline_db_password
}
To load the "bad" data instead:
module "bad_1_stream" {
source = "snowplow-devops/kinesis-stream/aws"
version = "0.2.0"
name = "bad-1-stream"
}
module "postgres_loader_bad" {
source = "snowplow-devops/postgres-loader-kinesis-ec2/aws"
name = "postgres-loader-bad-server"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
in_stream_name = module.bad_1_stream.name
# Note: The purpose defines what the input data set should look like
purpose = "JSON"
# Note: This schema is created automatically by the VM on launch
schema_name = "atomic_bad"
ssh_key_name = "your-key-name"
ssh_ip_allowlist = ["0.0.0.0/0"]
# Linking in the custom Iglu Server here
custom_iglu_resolvers = [
{
name = "Iglu Server"
priority = 0
uri = "http://your-iglu-server-endpoint/api"
api_key = var.iglu_super_api_key
vendor_prefixes = []
}
]
db_sg_id = module.pipeline_rds.sg_id
db_host = module.pipeline_rds.address
db_port = module.pipeline_rds.port
db_name = local.pipeline_db_name
db_username = local.pipeline_db_username
db_password = local.pipeline_db_password
}
As you load data into the database it will start to fill up naturally! To handle this seamlessly you can enabled auto-scaling for RDS by updating this module snippet as follows:
module "pipeline_rds" {
source = "snowplow-devops/rds/aws"
version = "0.1.4"
# Note: Enables autoscaling storage to up to 100gb from the default 10gb
max_allocated_storage = 100
name = "pipeline-rds"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
db_name = local.pipeline_db_name
db_username = local.pipeline_db_username
db_password = local.pipeline_db_password
# Note: this exposes your data to the internet - take care to ensure your allowlist is strict enough
# or provide a way to access the database through the VPC instead
publicly_accessible = true
additional_ip_allowlist = local.pipeline_ip_allowlist
}
Name | Version |
---|---|
terraform | >= 1.0.0 |
aws | >= 3.45.0 |
Name | Version |
---|---|
aws | >= 3.45.0 |
Name | Source | Version |
---|---|---|
instance_type_metrics | snowplow-devops/ec2-instance-type-metrics/aws | 0.1.2 |
kcl_autoscaling | snowplow-devops/dynamodb-autoscaling/aws | 0.2.0 |
tags | snowplow-devops/tags/aws | 0.2.0 |
telemetry | snowplow-devops/telemetry/snowplow | 0.3.0 |
Name | Type |
---|---|
aws_autoscaling_group.asg | resource |
aws_cloudwatch_log_group.log_group | resource |
aws_dynamodb_table.kcl | resource |
aws_iam_instance_profile.instance_profile | resource |
aws_iam_policy.iam_policy | resource |
aws_iam_role.iam_role | resource |
aws_iam_role_policy_attachment.policy_attachment | resource |
aws_launch_configuration.lc | resource |
aws_security_group.sg | resource |
aws_security_group_rule.egress_tcp_443 | resource |
aws_security_group_rule.egress_tcp_80 | resource |
aws_security_group_rule.egress_tcp_server_rds | resource |
aws_security_group_rule.egress_udp_123 | resource |
aws_security_group_rule.ingress_tcp_22 | resource |
aws_security_group_rule.rds_egress_tcp_webserver | resource |
aws_ami.amazon_linux_2 | data source |
aws_caller_identity.current | data source |
aws_region.current | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
db_host | The hostname of the database to connect to | string |
n/a | yes |
db_name | The name of the database to connect to | string |
n/a | yes |
db_password | The password to use to connect to the database | string |
n/a | yes |
db_port | The port the database is running on | number |
n/a | yes |
db_sg_id | The ID of the RDS security group that sits downstream of the webserver | string |
n/a | yes |
db_username | The username to use to connect to the database | string |
n/a | yes |
in_stream_name | The name of the input kinesis stream that the Enricher will pull data from | string |
n/a | yes |
name | A name which will be pre-pended to the resources created | string |
n/a | yes |
purpose | The type of data the loader will be pulling which can be one of ENRICHED_EVENTS or JSON (Note: JSON can be used for loading bad rows) | string |
n/a | yes |
schema_name | The database schema to load data into (e.g atomic | atomic_bad) | string |
n/a | yes |
ssh_key_name | The name of the SSH key-pair to attach to all EC2 nodes deployed | string |
n/a | yes |
subnet_ids | The list of subnets to deploy the Postgres Loader across | list(string) |
n/a | yes |
vpc_id | The VPC to deploy the Postgres Loader within | string |
n/a | yes |
amazon_linux_2_ami_id | The AMI ID to use which must be based of of Amazon Linux 2; by default the latest community version is used | string |
"" |
no |
associate_public_ip_address | Whether to assign a public ip address to this instance | bool |
true |
no |
cloudwatch_logs_enabled | Whether application logs should be reported to CloudWatch | bool |
true |
no |
cloudwatch_logs_retention_days | The length of time in days to retain logs for | number |
7 |
no |
custom_iglu_resolvers | The custom Iglu Resolvers that will be used by Enrichment to resolve and validate events | list(object({ |
[] |
no |
db_max_connections | The maximum number of connections to the backing database | number |
10 |
no |
default_iglu_resolvers | The default Iglu Resolvers that will be used by Enrichment to resolve and validate events | list(object({ |
[ |
no |
iam_permissions_boundary | The permissions boundary ARN to set on IAM roles created | string |
"" |
no |
in_max_batch_size_checkpoint | The maximum number events to process before checkpointing progress on the stream | number |
1000 |
no |
in_max_batch_wait_checkpoint | The maximum amount of time to wait before checkpointing progress on the stream | string |
"10 seconds" |
no |
initial_position | Where to start processing the input Kinesis Stream from (TRIM_HORIZON or LATEST) | string |
"TRIM_HORIZON" |
no |
instance_type | The instance type to use | string |
"t3a.micro" |
no |
java_opts | Custom JAVA Options | string |
"-Dorg.slf4j.simpleLogger.defaultLogLevel=info -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=75" |
no |
kcl_read_max_capacity | The maximum READ capacity for the KCL DynamoDB table | number |
10 |
no |
kcl_read_min_capacity | The minimum READ capacity for the KCL DynamoDB table | number |
1 |
no |
kcl_write_max_capacity | The maximum WRITE capacity for the KCL DynamoDB table | number |
10 |
no |
kcl_write_min_capacity | The minimum WRITE capacity for the KCL DynamoDB table | number |
1 |
no |
max_size | The maximum number of servers in this server-group | number |
2 |
no |
min_size | The minimum number of servers in this server-group | number |
1 |
no |
ssh_ip_allowlist | The list of CIDR ranges to allow SSH traffic from | list(any) |
[ |
no |
tags | The tags to append to this resource | map(string) |
{} |
no |
telemetry_enabled | Whether or not to send telemetry information back to Snowplow Analytics Ltd | bool |
true |
no |
user_provided_id | An optional unique identifier to identify the telemetry events emitted by this stack | string |
"" |
no |
Name | Description |
---|---|
asg_id | ID of the ASG |
asg_name | Name of the ASG |
sg_id | ID of the security group attached to the Postgres Loader servers |
The Terraform AWS Postgres Loader on EC2 project is Copyright 2021-2023 Snowplow Analytics Ltd.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.