Kafka Connect S3

This is a kafka-connect sink and source for Amazon S3, but without any dependency on HDFS/hadoop libs or data formats.

Key Features:

Block GZip output - This keeps storage costs low.
Accurately reflects source topic - The original partition and offset of your records are recorded in S3. This allows you to:
Easily read from a specific topic and partition - Index files make reading a particular offset very efficient, so you only have to download the data that you need.
Seek to a date & time - Your bucket will be broken into daily prefixes, which makes it possible to find data that was written around a specific date and time.

Archive Fork

This is a hard fork of the S3 Sink created by DeviantArt then modified by Spredfast - This repository is currently not public.

Spredfast Fork

This is a hard fork of the S3 Sink created by DeviantArt.

Notable differences:

Requires Java 8+
Requires Kafka 0.10.0+
Supports Binary and Custom Output Formats
Provides a Source for reading data back from S3
Repackaged and built with Gradle

We are very grateful to the DeviantArt team for their original work. We made the decision to hard fork when it became clear that we would be responsible for ongoing maintenance.

Changelog

0.4.0
- BREAKING CHANGE: Changed the way S3Source offsets are stored to accommodate multiple topics in the same day prefix. Not compatible with old offsets.

Usage

Use just like any other Connector: add it to the Connect classpath and configure a task. Read the rest of this document for configuration details.

Important Configuration

Only bytes may be written to S3, so you must configure the Sink & Source connectors to "convert" to bytes.

Worker vs. Connector Settings

There are important settings for your Kafka Connect cluster itself, and important settings for individual connectors. The following puts cluster settings in connect-worker.properties and individual Connector settings in their respective files.

Recommended Worker Configs

connect-worker.properties:

# too many records can overwhelm the poll loop on large topics and will result in
# Connect continously rebalancing without making progress
consumer.max.poll.records=500
# Flushing to S3 can take some time, so allow for more than the default 5 seconds when shutting down.
task.shutdown.graceful.timeout.ms=30000

0.10.1.0+

Connect 0.10.1.0 introduced the ability to specify converters at the connector level, so you should specify the AlreadyBytesConverter for both the sink and source.

connect-s3/sink.properties:

key.converter=com.spredfast.kafka.connect.s3.AlreadyBytesConverter
value.converter=com.spredfast.kafka.connect.s3.AlreadyBytesConverter

Pre 0.10.1.0

On older Connect versions only the worker key and value converters can be configured, so to get raw bytes for S3 you must either:

Configure your cluster converter to leave records as raw bytes:

connect-worker.properties:

key.converter=com.spredfast.kafka.connect.s3.AlreadyBytesConverter
value.converter=com.spredfast.kafka.connect.s3.AlreadyBytesConverter

Provide the S3 connector with the same converter (to reverse the process.) e.g.,

connect-worker.properties:

key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter

connect-s3-sink/sink.properties:

key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter

See the wiki for further details.

Build and Run

You should be able to build this with ./gradlew build. Once the jar is generated in build/libs, include it in CLASSPATH (e.g., export CLASSPATH=.:$CLASSPATH:/fullpath/to/kafka-connect-s3-jar )

Run: bin/connect-standalone.sh example-connect-worker.properties example-connect-s3-sink.properties(from the root directory of project, make sure you have kafka on the path, if not then give full path of kafka before bin)

S3 File Format

The default file format is a UTF-8, newline delimited text file. This works well for JSON records, but is unsafe for other formats that may contain newlines or any arbitrary byte sequence.

Binary Records

The binary output encodes values (and optionally keys) with a 4 byte length, followed by the value. Any record can be safely encoded this way.

format=binary
format.include.keys=true

NOTE: It is critical that the format settings in the S3 Source match the setting of the S3 Sink exactly, otherwise keys, values, and record contents will be corrupted.

Custom Delimiters

The default format is text, with UTF-8 newlines between records. Keys are dropped. The delimiters and inclusion of keys can be customized:

# this line may be omitted
format=text
# default delimiter is a newline
format.value.delimiter=\n
# UTF-8 is the default encoding
format.value.encoding=UTF-16BE
# keys will only be written if a delimiter is specified
format.key.delimiter=\t
format.key.encoding=UTF-16BE

NOTE: Only the delimiter you specify is encoded. The bytes of the records will be written unchanged. The purpose of the config is to match the delimiter to the record encoding.

Charsets Tip: If using UTF-16, specify UTF-16BE or UTF-16LE to avoid including an addition 2 byte BOM for every key and value.

Custom Format

A custom S3RecordFormat can be specified by providing the class name:

format=com.mycompany.MyS3RecordFormatImpl
format.prop1=abc

Refer to the S3 Formats wiki for more information.

Configuration

In addition to the standard kafka-connect config options and format settings we support/require the following, in the task properties file or REST config:

Config Key	Default	Notes
s3.bucket	REQUIRED	The name of the bucket to write too.
s3.prefix	`""`	Prefix added to all object keys stored in bucket to "namespace" them.
s3.endpoint	AWS defaults per region	Mostly useful for testing.
s3.path_style	`false`	Force path-style access to bucket rather than subdomain. Mostly useful for tests.
compressed_block_size	67108864	How much uncompressed data to write to the file before we rol to a new block/chunk. See Block-GZIP section above.

Note that we use the default AWS SDK credentials provider. Refer to their docs for the options for configuring S3 credentials.

These additional configs apply to the Source connector:

Config Key	Default	Notes
max.poll.records	1000	The number of records to return in a single poll of S3
s3.page.size	100	The number of objects we list from S3 in one request
max.partition.count	200	The maximum number of partitions a topic can have. Partitions over this number will not be processed.
targetTopic.${original}	none	If you want the source to send records to an different topic than the original. e.g., targetTopic.foo=bar would send messages originally in topic foo to topic bar.
s3.start.marker	`null`	List-Object Marker. S3 object key or key prefix to start reading from.

Contributing

Pull requests welcome! If you need ideas, check the issues for open enhancements.

Publishing

# publish to nexus
export MAVEN_USERNAME=...
export MAVEN_PASSWORD=...
./gradlew publishShadowPublicationToMavenRepository

# publish to github
export USERNAME=ramanenka
export TOKEN=... # a personal github token that allows to write:packages
./gradlew publishShadowPublicationToGitHubPackagesRepository

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
.github		.github
api		api
common		common
gradle/wrapper		gradle/wrapper
sink		sink
source		source
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
build.gradle		build.gradle
example-connect-s3-sink.properties		example-connect-s3-sink.properties
example-connect-worker.properties		example-connect-worker.properties
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle
test		test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kafka Connect S3

Archive Fork

Spredfast Fork

Changelog

Usage

Important Configuration

Worker vs. Connector Settings

Recommended Worker Configs

0.10.1.0+

Pre 0.10.1.0

Build and Run

S3 File Format

Binary Records

Custom Delimiters

Custom Format

Configuration

Contributing

Publishing

About

Releases

Packages

Languages

License

sugarcrm/kafka-connect-s3

Folders and files

Latest commit

History

Repository files navigation

Kafka Connect S3

Archive Fork

Spredfast Fork

Changelog

Usage

Important Configuration

Worker vs. Connector Settings

Recommended Worker Configs

0.10.1.0+

Pre 0.10.1.0

Build and Run

S3 File Format

Binary Records

Custom Delimiters

Custom Format

Configuration

Contributing

Publishing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages