Lattice 2959 archiving #216

and-carter · 2021-11-12T19:37:18Z

General
Add capability to export / import data between s3 and Atlas via ShuttleCli.

Archive Methods
There are two ways to archive data.

Chunked by date
The entire table

Chunked by date is preferred as it more transparently organizes data in s3. If the table does not have a date field or if uploading as a single table is preferable, then use the second method. To do this, do not provide a start-date argument — the entire table is uploaded if start-date is absent.

Usage

Indicate export / import by argument
Indicate temporal details by argument
Provide connection, source, and destination details in a YAML file

shuttle
        --archive export
        --config /path/to/config/file.yaml
        --start-date 2021-11-10
        --days 5

where

archive specifies whether to import export
config defines path of the config file
start-date (optional) is the date to begin (inclusive at 00:00)
days (optional) is the number of days to integrate beginning at state-date. Default is 1

Yaml file template:

archiveConfig:
  hikariConfig:
    jdbcUrl: "jdbcUrl"
    username: "user1234"
    password: "***"
    maximumPoolSize: 5
    connectionTimeout: 60000
  s3Bucket: "s3-bucket-name"
  s3Region: "s3-region-name"
  accessKey: "aws-access-key"
  secretKey: "aws-secret-key"
dbName: "org_my_db"
schemaName: "openlattice"
sourceName: "example_table"
destinationName: "example_table" # in most cases leave the same as sourceName
dateField: "column_name" # empty string if not needed

S3 storage details
Depending on which archive method one choses, s3 file names differ slightly.

File name structure when chunked by date:

/archive01/$dbName/$schemaName/$destinationName/$destinationName_$date[_part$partNumber]

File name structre when archiving the entire table:

/archive01/$dbName/$schemaName/$destinationName[_part$partNumber]

Current gotchas

Write validation isn't super helpful if data already exists for a given set of parameters
Unchunked uploads greater than ~5000gb will likely fail. This is due to the pagination that occurs at very large loads and is currently unsupported by shuttle archiver. Chunking by date should make this a non-issue for the scale of data we handle.
If data archived a second time, and the second time around there is less data than in the first archival, old, non-overwritten data may still exist. If that happens, importing will give a mix of old and current data. As a current workaround, use a new unique destinationName when archiving a second time with less data.
Note that the Aurora extension to archive stores data in 5-6 GB files called "parts". So if 30 GB of data is uploaded with a destination of foo, then in s3 files will appear as foo, foo_part2, foo_part3 and so on. The Shuttle archiver automatically imports all parts, so this behavior does not affect usage.

Potential Improvements

Better validation. Perhaps ensure that file size and/or row count match expectations.
Prevent old data from persisting when re-archiving

and-carter added 8 commits November 4, 2021 18:37

WIP add archiving

a1ea248

clean up and comments

316f78b

cleanup

8cd4fba

WIP add multipart support and validation

0052e20

clean up

d9b5e50

allow archiving by date(s) or entire table

f17ddb5

fix s3 prefixes

7ddcf8c

move aws creds to config

15c9c98

and-carter requested review from geekbeast and UnsungHero97 as code owners November 12, 2021 19:37

geekbeast approved these changes Mar 23, 2022

View reviewed changes

geekbeast added 2 commits March 23, 2022 00:31

Resolve merge conflict.s

38a7132

Merge branch 'develop' into LATTICE-2959-archiving

955c847

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lattice 2959 archiving #216

Lattice 2959 archiving #216

and-carter commented Nov 12, 2021

Lattice 2959 archiving #216

Are you sure you want to change the base?

Lattice 2959 archiving #216

Conversation

and-carter commented Nov 12, 2021