Skip to content

Commit

Permalink
Docs: Better roadmap structure (#2179)
Browse files Browse the repository at this point in the history
  • Loading branch information
ozkatz authored Jul 1, 2021
1 parent febc95e commit efb8af6
Showing 1 changed file with 90 additions and 67 deletions.
157 changes: 90 additions & 67 deletions docs/understand/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: Roadmap
parent: Understanding lakeFS
description: New features and improvements are lined up next for lakeFS. We would love you to be part of building lakeFS’s roadmap.
description: New features and improvements are lined-up next for lakeFS. We would love you to be part of building lakeFS’s roadmap.
nav_order: 40
has_children: false
redirect_from: ../roadmap.html
Expand All @@ -17,105 +17,128 @@ redirect_from: ../roadmap.html
1. TOC
{:toc}

---

## Architecture
## Use Case: Development Environment

_TL;DR_ - After receiving feedback on early versions of lakeFS, project **["lakeFS on the Rocks"](https://docs.google.com/document/d/1jzD7-jun-tdU5BGapmnMBe9ovSzBvTNjXCcVztV07A4/edit?usp=sharing){:target="_blank"}** represents a set of changes to the architecture and data model of lakeFS. The main motivators are simplicity, reduced barriers of entry, scalability - and the added benefit of having lakeFS adhere more closely to Git in semantics and UX.
{: .note .note-warning }
### Ephemeral branches with a TTL
Throwaway development or experimentation branches that live for a pre-configured amount of time, and are cleaned up afterwards. This is especially useful when running automated tests or when experimenting with new technologies, code or algorithms. We want to see what the outcome looks like, but don’t really need the output to live much longer than the duration of the experiment.

There are 3 big shifts in design:
[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/2180){: target="_blank" class="btn" }


1. Using the underlying object store as a source of truth for all committed data. We do this by storing commits as RocksDB SSTable files, where each commit is a “snapshot” of a given repository, split across multiple SSTable files that could be reused across commits.
1. Removal of PostgreSQL as a dependency: scaling it for very high throughput while keeping it predictable in performance for different loads and access patterns has a very high operational cost.
1. Extract metadata operations into a separate service (with the S3 API Gateway remaining as a stateless client for this service). Would allow for the development of “native” clients for big data tools that don’t require passing the data itself through lakeFS, but rather talk directly to the underlying object store.
---

The change will be made gradually over (at least) 3 releases:
## Use Case: Continuous Integration

### ~~lakeFS On The Rocks: Milestone #1 - Committed Metadata Extraction~~ Done
{: .no_toc }
The initial release of the new lakeFS design will include the following changes:
1. Commit metadata stored on S3 in [SSTable format](https://blog.lowentropy.info/topics/deep-into-rocksdb/sstable-format-blockbased)
1. Uncommitted entries will be stored in PostgreSQL
1. Refs (branches, tags) will be stored in PostgreSQL
### Repo linking
The ability to explicitly depend on data residing in another repository. While it is possible to state these cross-links by sticking them in the report’s commit metadata, we think a more explicit and structured approach would be valuable. Stating our dependencies in something that resembles a [pom.xml](https://maven.apache.org/guides/introduction/introduction-to-the-pom.html#:~:text=A%20Project%20Object%20Model%20or,default%20values%20for%20most%20projects.) or [go.mod](https://github.com/golang/go/wiki/Modules#gomod) file would allow us to support better CI and CD integrations that ensure reproducibility without vendoring or copying data.

This release doesn't change the way lakeFS is deployed or operates - but represents a major change in data model (moving from legacy MVCC data model, to a Merkle tree structure, which brings lakeFS much closer to Git).
[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/1771){: target="_blank" class="btn" }

While there's still a strong dependency on PostgreSQL, the schema and access patterns are much simpler, resulting in improved performance and reduced operational ovearhead.
### Git-lakeFS integration
The ability to connect Git commits with lakeFS commits.
Especially useful for reproducibility: By looking at a set of changes to the **data**, be able to reference (or ever run) the job that produced it.

### lakeFS On The Rocks: Milestone #2 - Metadata Service
Extracting metadata into its own service:
1. [gRPC](https://grpc.io/) service that exposes metadata operations
1. S3 gateway API and OpenAPI servers become stateless and act as clients for the metadata service
1. Native [Hadoop Filesystem](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html) implementation on top of the metadata service, with [Apache Spark](https://spark.apache.org/) support (depends on Java native SDK for metadata server below)
[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/2073){: target="_blank" class="btn" }

### lakeFS On The Rocks: Milestone #3 - Remove PostgreSQL
Removing PostgreSQL for uncommitted data and refs, moving to [Raft](https://raft.github.io/):
1. Turn the metadata server into a Raft consensus group, managing state in [RocksDB](https://rocksdb.org/)
1. Remove dependency on PostgreSQL
1. Raft snapshots stored natively on underlying object stores

This release will mark the completion of project **["lakeFS on the Rocks"](https://docs.google.com/document/d/1jzD7-jun-tdU5BGapmnMBe9ovSzBvTNjXCcVztV07A4/edit?usp=sharing){:target="_blank"}**
### Protected Branches
A way to ensure certain branches (i.e. main) are only merged to, and are not being directly written to. In combination with [Webhook Support](../setup/hooks.md), this allows users to provide a set of quality guarantees about a given branch (i.e., reading from
main ensures schema never breaks and all partitions are complete and tested)

## Operations
[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/2181){: target="_blank" class="btn" }

---

### Kubernetes operator for gateways and Metadata servers
We see Kubernetes as a first class deployment target for lakeFS. While deploying the stateless components such as the S3 Gateway and the OpenAPI server is relatively easy, deploying the metadata service which is a stateful Raft group is a little more involved. Design is still pending while we learn more about the best practices (for example, the [Consul Kubernetes operator](https://www.consul.io/docs/k8s/installation/install#architecture)).
## Use Case: Continuous Deployment

### Metadata operations security and access model
Reduce the operational overhead of managing access control: Currently operators working with both lakeFS and the native object store are required to manage a similar set of access controls for both.
Moving to a federated access control model using the object store’s native access control facilities (e.g. [IAM](https://aws.amazon.com/iam/)) will help reduce this overhead. This requires more discovery around the different use cases to help design something coherent. If you’re using lakeFS and have strong opinions about access control, please reach out on Slack.
### Native Metastore Integration <span>High Priority</span>{: .label }

### lakeFS hard delete
lakeFS delete is a metadata operation, objects referenced from commits will always be available therefore will never be removed from the underlying storage.
To support hard deletion lakeFS will supply two operations:
1. Garbage Collection - allow users to specify the period of time objects will be available after deleted, for more information [see this GitHub issue](https://github.com/treeverse/lakeFS/issues/1932)
2. Object Lifecycle - allow users to define a lifecycle configuration on lakeFS paths to allow hard deletion based on path prefixes, for more information [see this GitHub issue](https://github.com/treeverse/lakeFS/issues/1566)
Create a robust connection between a Hive Metastore and lakeFS.
Ideally, metastore representations of tables managed in lakeFS should be versioned in the same way.

## Clients
This will allow users to move across different branches or commits for both data and metadata, so that querying from a certain commit will always produce the same results.

Additionally, for CD use cases, it will allow a merge operation to introduce Hive table changes (schema evolution, partition addition/removal) atomically alongside the change to the data itself - as well as track those changes with the same set of commits - a lakeFS diff will show both metadata and data changes.

### Java native SDK for metadata server
Will be the basis of a [Hadoop Filesystem](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html) implementation. It will allow compute engines to access the data directly without proxying it through lakeFS, improving operational efficiency and scalability.
[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/1846){: target="_blank" class="btn" }

### Python SDK for metadata server
In order to support automated data CI/CD integration with pipeline management and orchestration tools such as [Apache Airflow](https://airflow.apache.org/) and [Luigi](https://luigi.readthedocs.io/en/stable/) (see [Continuous Deployment](#use-case-continuous-deployment) below), a Python SDK for lakeFS metadata API is required. It will be used by the libraries native to each orchestration tool.
### Webhooks: merged/committed snapshot <span>High Priority</span>{: .label }

## Use Case: Development Environment
Currently, pre-merge hooks run before a merged view is even available (and before conflicts are detected).
Ideally, pre-merge should actually run after the logical merge operation has completed: the hook should be handed a commit to run on, otherwise modeling data validation tests is very hard.
Once all hooks pass, the branch is then moved to point at the new merge commit, completing the operation.

### Ephemeral branches with a TTL
Throwaway development or experimentation branches that live for a pre-configured amount of time, and are cleaned up afterwards. This is especially useful when running automated tests or when experimenting with new technologies, code or algorithms. We want to see what the outcome looks like, but don’t really need the output to live much longer than the duration of the experiment.
This change will also allow running hooks without holding a branch lock, moving to an optimistic concurrency model. This is required in order to support long-running operations such as Spark jobs, table scans, etc.

[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/1742){: target="_blank" class="btn" }

## Use Case: Continuous Integration
### Webhook Support integration: Metastore registration
Using webhooks, we can automatically register or update collections in a Hive/Glue metastore, using [Symlink Generation](../integrations/glue_hive_metastore.md#create-symlink), this will also allow systems that don’t natively integrate with lakeFS to consume data produced using lakeFS.

### Repo linking
The ability to explicitly depend on data residing in another repository. While it is possible to state these cross links by sticking them in the report’s commit metadata, we think a more explicit and structured approach would be valuable. Stating our dependencies in something that resembles a [pom.xml](https://maven.apache.org/guides/introduction/introduction-to-the-pom.html#:~:text=A%20Project%20Object%20Model%20or,default%20values%20for%20most%20projects.) or [go.mod](https://github.com/golang/go/wiki/Modules#gomod) file would allow us to support better CI and CD integrations that ensure reproducibility without vendoring or copying data.
[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/2182){: target="_blank" class="btn" }

### Webhook Support
Being able to have pre-defined code execute before a commit or merge operation - potentially preventing that action from taking place. This allows lakeFS users to codify best practices (format and schema enforcement before merging to main) as well as run tools such as [Great Expectations](https://greatexpectations.io/) or [Monte Carlo](https://www.montecarlodata.com/) **before** data ever reaches consumers.
### Webhook Alerting
Support integration into existing alerting systems that trigger in the event a webhook returns a failure. This is useful for example when a data quality test fails, so new data is not merged into main due to a quality issue, so will alert the owning team.

### Protected Branches
A way to ensure certain branches (i.e. main) are only merged to, and are not being directly written to. In combination with [Webhook Support](#webhook-support) (see above), this allows users to provide a set of quality guarantees about a given branch (i.e., reading from
main ensures schema never breaks and all partitions are complete and tested)
[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/2183){: target="_blank" class="btn" }

### Webhook Support integration: Metastore registration
Using webhooks, we can automatically register or update collections in a Hive/Glue metastore, using [Symlink Generation](../integrations/glue_hive_metastore.md#create-symlink), this will also allow systems that don’t natively integrate with lakeFS to consume data produced using lakeFS.
---

### Webhook Support integration: Metadata validation
Provide a basic wrapper around something like [pyArrow](https://pypi.org/project/pyarrow/) that validates Parquet or ORC files for common schema problems such as backwards incompatibility.

## Use Case: Continuous Deployment
## Architecture & Operations

### Airflow Operators
Provide a set of reusable building blocks for Airflow that can create branches, commit and merge. The idea here is to enhance existing pipelines that, for example, run a series of Spark jobs, with an easy way to create a lakeFS branch before starting, passing that branch as a parameter to all Spark jobs, and upon successful execution, commit and merge their output to main.
_TL;DR_ - After receiving feedback on early versions of lakeFS, project **["lakeFS on the Rocks"](https://docs.google.com/document/d/1jzD7-jun-tdU5BGapmnMBe9ovSzBvTNjXCcVztV07A4/edit?usp=sharing){:target="_blank"}** represents a set of changes to the architecture and data model of lakeFS. The main motivators are simplicity, reduced barriers of entry, scalability - and the added benefit of having lakeFS adhere more closely to Git in semantics and UX.
{: .note .note-warning }

[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/1771){: target="_blank" class="btn" }
There are 3 big shifts in design:

### Webhook Support integration: Data Quality testing
Provide a webhook around a tool such as [Great Expectations](https://greatexpectations.io/) that runs data quality tests before merging into a main branch.

1. ~~Using the underlying object store as a source of truth for all committed data. We do this by storing commits as RocksDB SSTable files, where each commit is a "snapshot" of a given repository, split across multiple SSTable files that could be reused across commits.~~
1. ~~Expose metadata operations as part of the OpenAPI gateway to allow other client types (e.g., Hadoop FileSystem) except for the S3 gateway interface~~
1. Implement a pluggable ref-store that allows storing references not (only) on PostgreSQL

### Webhook Alerting
Support integration into existing alerting systems that trigger in the event a webhook returns a failure. This is useful for example when a data quality test fails, so new data is not merged into main due to a quality issue, so will alert the owning team.
### Decouple ref-store from PostgreSQL

Currently lakeFS requires a PostgreSQL database. Internally, it is used to store references (branches, tags, etc) other metadata such as user management.

Making this store a pluggable component would allow the following:

1. Simpler quickstart using **only an object store**: allow running lakeFS without any dependencies. This ref-store will use the underlying object store to also store the references. For S3 (or any object store that doesn't support any native transaction/compare-and-swap semantics) this will be available only when running in single-instance mode. This is still beneficial for running lakeFS in POC or development mode, removing the need to run and connect multiple Docker containers.
1. Flexible production setup: A PostgreSQL option will still be available, but additional implementations will also be possible: running lakeFS as a Raft consensus group, using an other RDBMS types such as MySQL &emdash; or using managed services such as DynamoDB that lakeFS will be able to manage itself
1. Easier scalability: Scaling RDBMS for very high throughput while keeping it predictable in performance for different loads and access patterns has a very high operational cost.

This release will mark the completion of project **["lakeFS on the Rocks"](https://docs.google.com/document/d/1jzD7-jun-tdU5BGapmnMBe9ovSzBvTNjXCcVztV07A4/edit?usp=sharing){:target="_blank"}**

[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/pull/1685){: target="_blank" class="btn" }


### Snowflake Support <span>Requires Discussion</span>{: .label .label-yellow }

TBD - We don't yet have concrete plans on how to handle Snowflake (and potentially other Data Warehouse/Database sources).
If you'd like to have data in Snowflake managed by lakeFS, with full branching/merging/CI/CD capabilities, please contact us!

[Contact us, we'd love to talk about it!](mailto:hello@treeverse.io?subject=using+lakeFS+with+Snowflake){: target="_blank" class="btn" }


### Metadata operations security and access model <span>High Priority</span>{: .label }
Reduce the operational overhead of managing access control: Currently operators working with both lakeFS and the native object store are required to manage a similar set of access controls for both.
Moving to a federated access control model using the object store’s native access control facilities (e.g. [IAM](https://aws.amazon.com/iam/)) will help reduce this overhead. This requires more discovery around the different use cases to help design something coherent. If you’re using lakeFS and have strong opinions about access control, please reach out on Slack.

[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/2184){: target="_blank" class="btn" }


### Native Spark OutputCommitter

Provide a Spark OutputCommitter that actually... commits.
This allows creating atomic Spark writes that are automatically tracked in lakeFS as commits.

Each job will use its native job ID as (part of) a branch name for isolation, with the Output Committer doing a commit and merge operation to the requested branch on success. This has several benefits:

- Performance: This committer does metadata operations only, and doesn't rely on copying data
- Atomicity: A commit in lakeFS is guaranteed to either succeed or fail, but will not leave any intermediate state on failure.
- Allows incorporating simple hooks into the spark job: users can define a webhook to happen before such a merge is completed successfully
- Traceability: Attaching metadata to each commit means we get quite a lot of information on where data is coming from, how it's generated, etc. This allows building reproducible pipelines in an easier way.

[Track and discuss on GitHub](https://github.com/treeverse/lakeFS/issues/2042){: target="_blank" class="btn" }

0 comments on commit efb8af6

Please sign in to comment.