Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lattice 2959 archiving #216

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from
Open

Lattice 2959 archiving #216

wants to merge 10 commits into from

Conversation

and-carter
Copy link

General
Add capability to export / import data between s3 and Atlas via ShuttleCli.

Archive Methods
There are two ways to archive data.

  1. Chunked by date
  2. The entire table

Chunked by date is preferred as it more transparently organizes data in s3. If the table does not have a date field or if uploading as a single table is preferable, then use the second method. To do this, do not provide a start-date argument — the entire table is uploaded if start-date is absent.

Usage

  • Indicate export / import by argument
  • Indicate temporal details by argument
  • Provide connection, source, and destination details in a YAML file
shuttle
        --archive export
        --config /path/to/config/file.yaml
        --start-date 2021-11-10
        --days 5

where

  • archive specifies whether to import export
  • config defines path of the config file
  • start-date (optional) is the date to begin (inclusive at 00:00)
  • days (optional) is the number of days to integrate beginning at state-date. Default is 1

Yaml file template:

archiveConfig:
  hikariConfig:
    jdbcUrl: "jdbcUrl"
    username: "user1234"
    password: "***"
    maximumPoolSize: 5
    connectionTimeout: 60000
  s3Bucket: "s3-bucket-name"
  s3Region: "s3-region-name"
  accessKey: "aws-access-key"
  secretKey: "aws-secret-key"
dbName: "org_my_db"
schemaName: "openlattice"
sourceName: "example_table"
destinationName: "example_table" # in most cases leave the same as sourceName
dateField: "column_name" # empty string if not needed

S3 storage details
Depending on which archive method one choses, s3 file names differ slightly.

File name structure when chunked by date:

  • /archive01/$dbName/$schemaName/$destinationName/$destinationName_$date[_part$partNumber]

File name structre when archiving the entire table:

  • /archive01/$dbName/$schemaName/$destinationName[_part$partNumber]

Current gotchas

  • Write validation isn't super helpful if data already exists for a given set of parameters
  • Unchunked uploads greater than ~5000gb will likely fail. This is due to the pagination that occurs at very large loads and is currently unsupported by shuttle archiver. Chunking by date should make this a non-issue for the scale of data we handle.
  • If data archived a second time, and the second time around there is less data than in the first archival, old, non-overwritten data may still exist. If that happens, importing will give a mix of old and current data. As a current workaround, use a new unique destinationName when archiving a second time with less data.
  • Note that the Aurora extension to archive stores data in 5-6 GB files called "parts". So if 30 GB of data is uploaded with a destination of foo, then in s3 files will appear as foo, foo_part2, foo_part3 and so on. The Shuttle archiver automatically imports all parts, so this behavior does not affect usage.

Potential Improvements

  • Better validation. Perhaps ensure that file size and/or row count match expectations.
  • Prevent old data from persisting when re-archiving

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants