Spark SST Data Source enables users to decode SST files generated by RawKV backup to Key-Value pairs using Spark.
git clone git@github.com:tikv/client-java.git
mvn --file client-java/pom.xml clean install -DskipTests
git clone git@github.com:tikv/migration.git
cd migration
mvn clean package -DskipTests -am -pl sst-data-source
br backup raw \
--pd 127.0.0.1:2379 \
--storage "hdfs:///path/to/sst/" \
--start s \
--end t \
--format raw \
--cf default
spark-submit \
--master local[*] \
--jars /path/to/tikv-client-java-3.3.0-SNAPSHOT.jar \
--class org.tikv.datasources.sst.example.SSTDataSourceExample \
sst-data-source/target/sst-data-source-0.0.1-SNAPSHOT.jar \
hdfs:///path/to/sst/
Also we can write a self-contained application to decode sst files.
def main(args: Array[String]): Unit = {
val sstFilePath = "hdfs:///path/to/sst/"
val df = spark.read
.format("sst")
.load(sstFilePath)
df.printSchema()
df.count()
df.show(false)
}
The output of df.printSchema()
is as follows:
root
|-- key: binary (nullable = false)
|-- value: binary (nullable = true)
Key | Default Value | Description |
---|---|---|
path |
- | The path to the SST Files, e.g. hdfs:/path/to/sst/ |
enable-ttl |
false | Whether the TiKV Cluster enables ttl |
Default Spark version is 3.0.2. If you want to use other Spark version, please compile with the following command:
mvn clean package -DskipTests -Dspark.version.compile=3.1.1
To format the code, please run mvn mvn-scalafmt_2.12:format
or mvn clean package -DskipTests
.