Skip to content

Commit

Permalink
Spark client export documentation (#1693)
Browse files Browse the repository at this point in the history
  • Loading branch information
itaiad200 authored Apr 5, 2021
1 parent 5bc45a4 commit 2eb68e3
Show file tree
Hide file tree
Showing 2 changed files with 142 additions and 11 deletions.
120 changes: 120 additions & 0 deletions docs/reference/export.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
layout: default
title: Exporting data
description: Use lakeFS Spark client to export lakeFS commit to the object store.
parent: Reference
nav_exclude: true
has_children: false
---

# Exporting data
{: .no_toc }
The export operation copies all data from a given lakeFS commit to
a designated object store location.

For instance, the contents `lakefs://example@master` might be exported on
`s3://company-bucket/example/latest`. Clients entirely unaware of lakeFS could use that
base URL to access latest files on `master`. Clients aware of lakeFS can continue to use
the lakeFS S3 endpoint to access repository files on `s3://example/master`, as well as
other versions and uncommitted versions.

Possible use-cases:
1. External consumers of data don't have access to your lakeFS installation.
1. Some data pipelines in the organization are not fully migrated to lakeFS.
1. You want to experiment with lakeFS as a side-by-side installation first.
1. Create copies of your data lake in other regions (taking into account read pricing).

## Table of contents
{: .no_toc .text-delta }

1. TOC
{:toc}

## How to use

Set up lakeFS Spark metadata client with the endpoint and credentials as instructed in the previous [page](./spark-client.md).

The client exposes the `Exporter` object with 3 export options:

1. Export *all* objects at the HEAD of a given branch. Does not include
files that were added to that branch, but were not committed.

```scala
exportAllFromBranch(branch: String)
```

2. Export ALL objects from a commit:

```scala
exportAllFromCommit(commitID: String)
```

3. Export just the diff between a commit and the HEAD of a branch.
This is the ideal option for continuous exports of a branch, as it will change only the files
that have been changed since the previous commit.

```scala
exportFrom(branch: String, prevCommitID: String)
```

## Success/Failure Indications
When the Spark export operation ends, an additional status file will be added to the root
object storage destination.
If all files were exported successfully the file path will be of form: `EXPORT_<commitID>_SUCCESS`.
For failures: the form will be`EXPORT_<commitID>_FAILURE`, and the file will include a log of the failed files operations.

## Export Rounds (Spark success files)
Some files should be exported before others, e.g. a Spark `_SUCCESS` file exported before other files under
the same prefix might send the wrong indication.

The export operation may contain several *rounds* within the same export.
A failing round will stop the export of all the files of the next `rounds`.

By default, lakeFS will use the `SparkFilter` and have 2 `rounds` for each export.
The first round will export any non Spark `_SUCCESS` files. Second round will export all Spark's `_SUCCESS` files.
Users may override the default behaviour by passing a custom `filter` to the `Exporter`.

## Example

1. First configure the `Exporter` instance:

```scala
import io.treeverse.clients.{ApiClient, Exporter}
import org.apache.spark.sql.SparkSession

val endpoint = "http://<LAKEFS_ENDPOINT>/api/v1"
val accessKey = "<LAKEFS_ACCESS_KEY_ID>"
val secretKey = "<LAKEFS_SECRET_ACCESS_KEY>"

val repo = "example-repo"

val spark = SparkSession.builder().appName("I can export").master("local").getOrCreate()
val sc = spark.sparkContext
sc.hadoopConfiguration.set("lakefs.api.url", endpoint)
sc.hadoopConfiguration.set("lakefs.api.access_key", accessKey)
sc.hadoopConfiguration.set("lakefs.api.secret_key", secretKey)

// Add any required spark context configuration for s3
val rootLocation = "s3://company-bucket/example/latest"

val apiClient = new ApiClient(endpoint, accessKey, secretKey)
val exporter = new Exporter(spark, apiClient, repo, rootLocation)

```

a. Now you can export all objects from `main` branch to `s3://company-bucket/example/latest`:

```scala
val branch = "main"
exporter.exportAllFromBranch(branch)
```

b. Assuming a previous successful export on commit `f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7`,
you can alternatively export just the difference between `main` branch and the commit:

```scala
val branch = "main"
val commit = "f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7"
exporter.exportFrom(branch, commit)
```

33 changes: 22 additions & 11 deletions clients/spark/README.md → docs/reference/spark-client.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
---
layout: default
title: Spark Client
description: lakeFS spark client performs operations on lakeFS committed metadata stored in the object store.
parent: Reference
nav_exclude: true
has_children: false
---


# lakeFS Spark Client

Expand Down Expand Up @@ -32,19 +41,20 @@ Alternatively, the Jars are publicly available on S3:

1. To read metadata from lakeFS, the client should be configured with your lakeFS endpoint and credentials, using the following Hadoop configurations:

| Configuration | Description |
|--------------------------------------|--------------------------------------------------------------|
| `spark.hadoop.lakefs.api.url` | lakeFS API endpoint, e.g: `http://lakefs.example.com/api/v1` |
| `spark.hadoop.lakefs.api.access_key` | The access key to use for fetching metadata from lakeFS |
| `spark.hadoop.lakefs.api.secret_key` | Corresponding lakeFS secret key |
| Configuration | Description |
|--------------------------------------|--------------------------------------------------------------|
| `spark.hadoop.lakefs.api.url` | lakeFS API endpoint, e.g: `http://lakefs.example.com/api/v1` |
| `spark.hadoop.lakefs.api.access_key` | The access key to use for fetching metadata from lakeFS |
| `spark.hadoop.lakefs.api.secret_key` | Corresponding lakeFS secret key |

1. The client will also directly interact with your storage using Hadoop FileSystem. Therefore, your Spark session must be able to access the underlying storage of your lakeFS repository.
For instance, running as a user with a personal account on S3 (not in production) you might add:

| Configuration | Description |
|--------|-------|
| `spark.hadoop.fs.s3a.access.key` | Access key to use for accessing underlying storage on S3 |
| `spark.hadoop.fs.s3a.secret.key` | Corresponding secret key to use with S3 access key |

For instance, running as a user with a personal account on S3 (not in production) you might add:

| Configuration | Description |
|----------------------------------|----------------------------------------------------------|
| `spark.hadoop.fs.s3a.access.key` | Access key to use for accessing underlying storage on S3 |
| `spark.hadoop.fs.s3a.secret.key` | Corresponding secret key to use with S3 access key |


## Examples
Expand Down Expand Up @@ -84,3 +94,4 @@ For instance, running as a user with a personal account on S3 (not in production
*/
```


0 comments on commit 2eb68e3

Please sign in to comment.