Skip to content

Latest commit

 

History

History
89 lines (65 loc) · 2.1 KB

README.md

File metadata and controls

89 lines (65 loc) · 2.1 KB

Quick Start

Spark SST Data Source enables users to decode SST files generated by RawKV backup to Key-Value pairs using Spark.

Install tikv-client-java

git clone git@github.com:tikv/client-java.git
mvn --file client-java/pom.xml clean install -DskipTests

Build sst-data-source project

git clone git@github.com:tikv/migration.git
cd migration
mvn clean package -DskipTests -am -pl sst-data-source

Export SST

br backup raw \
--pd 127.0.0.1:2379 \
--storage "hdfs:///path/to/sst/" \
--start s \
--end t \
--format raw \
--cf default

Run SSTDataSourceExample

spark-submit \
--master local[*] \
--jars /path/to/tikv-client-java-3.3.0-SNAPSHOT.jar \
--class org.tikv.datasources.sst.example.SSTDataSourceExample \
sst-data-source/target/sst-data-source-0.0.1-SNAPSHOT.jar \
hdfs:///path/to/sst/

Call Spark SST Data Source

Also we can write a self-contained application to decode sst files.

  def main(args: Array[String]): Unit = {
    val sstFilePath = "hdfs:///path/to/sst/"
    val df = spark.read
      .format("sst")
      .load(sstFilePath)
    df.printSchema()
    df.count()
    df.show(false)
  }

The output of df.printSchema() is as follows:

root
 |-- key: binary (nullable = false)
 |-- value: binary (nullable = true)

Parameters

Key Default Value Description
path - The path to the SST Files, e.g. hdfs:/path/to/sst/
enable-ttl false Whether the TiKV Cluster enables ttl

Spark Version

Default Spark version is 3.0.2. If you want to use other Spark version, please compile with the following command:

mvn clean package -DskipTests -Dspark.version.compile=3.1.1

Develop

To format the code, please run mvn mvn-scalafmt_2.12:format or mvn clean package -DskipTests.

Documents