From f48f66700fad8793b3f84a6b8855210f386ce0da Mon Sep 17 00:00:00 2001 From: johnnyaug Date: Tue, 18 May 2021 14:18:30 +0300 Subject: [PATCH] Docs/makeover (#1961) * Delete community.md * docs quick wins * -- * -- * Update data-devenv.md * Update ci.md * Update data-devenv.md * Update cd.md * Delete recommendations.md * -- * -- * -- * -- * Update index.md * Update index.md * Update data-devenv.md * Update ci.md * Update cd.md * Update hooks.md * Update azure.md * Update gcp.md * Update k8s.md * Update index.md * Update index.md * Update index.md * Update index.md * Update blob.md * Update s3.md * Update gcs.md * Update minio.md * Update branching-model.md * -- * Update try.md * Update databricks.md * Update athena.md * Update aws_cli.md * Update glue_etl.md * Update index.md * Update sizing-guide.md * Update sizing-guide.md * Update data-model.md * Update licensing.md * -- * redirect from old links * trigger tests * remove duplication * add installation options * delete unused * add docker * add docker Co-authored-by: YaelRiv <67264175+YaelRiv@users.noreply.github.com> --- cmd/lakectl/cmd/docs.go | 2 +- docs/_config.yml | 2 + docs/architecture/index.md | 7 - docs/branching/index.md | 7 - docs/branching/recommendations.md | 164 ------------------ docs/community.md | 23 --- docs/deploy/aws.md | 116 +++++++++++++ docs/deploy/azure.md | 103 +++++++++++ docs/deploy/docker.md | 70 ++++++++ docs/deploy/gcp.md | 103 +++++++++++ .../includes}/aws-docker-config.md | 0 .../includes}/aws-docker-run.md | 0 .../includes}/aws-helm-values.md | 0 .../includes}/azure-docker-config.md | 0 .../includes}/azure-docker-run.md | 0 .../includes}/azure-helm-values.md | 0 .../includes}/gcp-docker-config.md | 0 .../includes}/gcp-docker-run.md | 0 .../includes}/gcp-helm-values.md | 0 docs/deploy/includes/prerequisites.md | 10 ++ docs/deploy/includes/why-dns.md | 12 ++ docs/deploy/index.md | 13 ++ docs/deploy/k8s.md | 85 +++++++++ docs/deploying-aws/bucket.md | 102 ----------- docs/deploying-aws/db.md | 58 ------- docs/deploying-aws/index.md | 11 -- docs/deploying-aws/install.md | 159 ----------------- docs/deploying-aws/lb_dns.md | 41 ----- docs/downloads.md | 25 --- .../model.md => guides/branching-model.md} | 11 +- docs/{ => guides}/hooks.md | 5 +- docs/{reference => guides}/import-mvcc.md | 4 +- docs/{reference => guides}/import.md | 11 +- docs/guides/index.md | 7 + docs/{deploying-aws => guides}/setup.md | 14 +- docs/index.md | 14 +- docs/{using => integrations}/airflow.md | 3 +- docs/{using => integrations}/athena.md | 5 +- docs/{using => integrations}/aws_cli.md | 5 +- docs/{using => integrations}/boto.md | 3 +- docs/{using => integrations}/databricks.md | 5 +- docs/{using => integrations}/distcp.md | 3 +- docs/{using => integrations}/emr.md | 3 +- docs/{using => integrations}/glue_etl.md | 5 +- .../glue_hive_metastore.md | 3 +- docs/{using => integrations}/hive.md | 3 +- docs/integrations/index.md | 7 + docs/{using => integrations}/kakfa.md | 3 +- docs/{using => integrations}/mapreduce.md | 3 +- docs/{using => integrations}/minio.md | 3 +- docs/{using => integrations}/presto.md | 3 +- docs/{using => integrations}/python.md | 3 +- docs/{using => integrations}/rclone.md | 3 +- docs/{using => integrations}/sagemaker.md | 3 +- docs/{using => integrations}/spark.md | 3 +- docs/quickstart/lakefs_cli.md | 4 +- docs/quickstart/more_quickstart_options.md | 2 +- docs/quickstart/repository.md | 6 +- docs/quickstart/try.md | 2 +- docs/reference/authorization.md | 4 +- docs/reference/commands.md | 7 +- docs/{deploying-aws => reference}/monitor.md | 7 +- .../offboarding.md | 5 +- docs/{deploying-aws => reference}/upgrade.md | 3 +- docs/storage/blob.md | 30 ++++ docs/storage/gcs.md | 16 ++ docs/storage/index.md | 17 ++ docs/storage/s3.md | 61 +++++++ .../architecture.md} | 11 +- .../data-model.md | 9 +- docs/understand/index.md | 9 + docs/{ => understand}/licensing.md | 8 +- docs/{ => understand}/roadmap.md | 6 +- .../sizing-guide.md | 16 +- docs/usecases/cd.md | 57 ++++++ docs/usecases/ci.md | 27 +++ docs/usecases/data-devenv.md | 85 +++++++++ docs/usecases/index.md | 9 + docs/using/index.md | 7 - 79 files changed, 971 insertions(+), 685 deletions(-) delete mode 100644 docs/architecture/index.md delete mode 100644 docs/branching/index.md delete mode 100644 docs/branching/recommendations.md delete mode 100644 docs/community.md create mode 100644 docs/deploy/aws.md create mode 100644 docs/deploy/azure.md create mode 100644 docs/deploy/docker.md create mode 100644 docs/deploy/gcp.md rename docs/{deploying-aws/installation-methods => deploy/includes}/aws-docker-config.md (100%) rename docs/{deploying-aws/installation-methods => deploy/includes}/aws-docker-run.md (100%) rename docs/{deploying-aws/installation-methods => deploy/includes}/aws-helm-values.md (100%) rename docs/{deploying-aws/installation-methods => deploy/includes}/azure-docker-config.md (100%) rename docs/{deploying-aws/installation-methods => deploy/includes}/azure-docker-run.md (100%) rename docs/{deploying-aws/installation-methods => deploy/includes}/azure-helm-values.md (100%) rename docs/{deploying-aws/installation-methods => deploy/includes}/gcp-docker-config.md (100%) rename docs/{deploying-aws/installation-methods => deploy/includes}/gcp-docker-run.md (100%) rename docs/{deploying-aws/installation-methods => deploy/includes}/gcp-helm-values.md (100%) create mode 100644 docs/deploy/includes/prerequisites.md create mode 100644 docs/deploy/includes/why-dns.md create mode 100644 docs/deploy/index.md create mode 100644 docs/deploy/k8s.md delete mode 100644 docs/deploying-aws/bucket.md delete mode 100644 docs/deploying-aws/db.md delete mode 100644 docs/deploying-aws/index.md delete mode 100644 docs/deploying-aws/install.md delete mode 100644 docs/deploying-aws/lb_dns.md delete mode 100644 docs/downloads.md rename docs/{branching/model.md => guides/branching-model.md} (93%) rename docs/{ => guides}/hooks.md (98%) rename docs/{reference => guides}/import-mvcc.md (95%) rename docs/{reference => guides}/import.md (96%) create mode 100644 docs/guides/index.md rename docs/{deploying-aws => guides}/setup.md (71%) rename docs/{using => integrations}/airflow.md (96%) rename docs/{using => integrations}/athena.md (82%) rename docs/{using => integrations}/aws_cli.md (95%) rename docs/{using => integrations}/boto.md (97%) rename docs/{using => integrations}/databricks.md (95%) rename docs/{using => integrations}/distcp.md (98%) rename docs/{using => integrations}/emr.md (98%) rename docs/{using => integrations}/glue_etl.md (78%) rename docs/{using => integrations}/glue_hive_metastore.md (99%) rename docs/{using => integrations}/hive.md (97%) create mode 100644 docs/integrations/index.md rename docs/{using => integrations}/kakfa.md (94%) rename docs/{using => integrations}/mapreduce.md (80%) rename docs/{using => integrations}/minio.md (95%) rename docs/{using => integrations}/presto.md (98%) rename docs/{using => integrations}/python.md (98%) rename docs/{using => integrations}/rclone.md (97%) rename docs/{using => integrations}/sagemaker.md (97%) rename docs/{using => integrations}/spark.md (98%) rename docs/{deploying-aws => reference}/monitor.md (90%) rename docs/{deploying-aws => reference}/offboarding.md (83%) rename docs/{deploying-aws => reference}/upgrade.md (97%) create mode 100644 docs/storage/blob.md create mode 100644 docs/storage/gcs.md create mode 100644 docs/storage/index.md create mode 100644 docs/storage/s3.md rename docs/{architecture/overview.md => understand/architecture.md} (92%) rename docs/{architecture => understand}/data-model.md (92%) create mode 100644 docs/understand/index.md rename docs/{ => understand}/licensing.md (95%) rename docs/{ => understand}/roadmap.md (96%) rename docs/{architecture => understand}/sizing-guide.md (97%) create mode 100644 docs/usecases/cd.md create mode 100644 docs/usecases/ci.md create mode 100644 docs/usecases/data-devenv.md create mode 100644 docs/usecases/index.md delete mode 100644 docs/using/index.md diff --git a/cmd/lakectl/cmd/docs.go b/cmd/lakectl/cmd/docs.go index d2acc9b7e67..257444b2b4b 100644 --- a/cmd/lakectl/cmd/docs.go +++ b/cmd/lakectl/cmd/docs.go @@ -32,7 +32,7 @@ has_children: false ` + "`" + `lakectl` + "`" + ` is distributed as a single binary, with no external dependencies - and is available for MacOS, Windows and Linux. -[Download lakectl](../downloads.md){: .btn .btn-green target="_blank"} +[Download lakectl](../index.md#downloads){: .btn .btn-green target="_blank"} ### Configuring credentials and API endpoint diff --git a/docs/_config.yml b/docs/_config.yml index 2009147e87c..adc1dfccaf0 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -66,3 +66,5 @@ image: '/assets/img/shared-image.png' plugins: - jekyll-redirect-from - jekyll-seo-tag + +exclude: ["deploy/includes"] diff --git a/docs/architecture/index.md b/docs/architecture/index.md deleted file mode 100644 index 77a06f1f8ea..00000000000 --- a/docs/architecture/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Architecture -description: This section contains information about how lakeFS is built, its different components and its data model -nav_order: 8 -has_children: true ---- diff --git a/docs/branching/index.md b/docs/branching/index.md deleted file mode 100644 index e347e43879c..00000000000 --- a/docs/branching/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Branching Model -description: At its core, lakeFS uses a Git-like branching model that scales to Petabytes of data by utilizing S3 or GCS for storage. -nav_order: 25 -has_children: true ---- diff --git a/docs/branching/recommendations.md b/docs/branching/recommendations.md deleted file mode 100644 index 258de39cd5b..00000000000 --- a/docs/branching/recommendations.md +++ /dev/null @@ -1,164 +0,0 @@ ---- -layout: default -title: Recommended Branching Models -description: The brnaching modles listed below are often used to help improve organization's data developmet and CI/CD -parent: Branching Model -has_children: false -nav_order: 2 ---- - -# Recommended Branching Models -{: .no_toc } - -## Table of contents -{: .no_toc .text-delta } - -1. TOC -{:toc} - -## Development Environment - -As part of our routine work with data we develop new code, improve and upgrade old code, upgrade infrastructures, and test new technologies. lakeFS enables a safe development environment on your data lake without the need to copy or mock data, work on the pipelines or involve DevOps. - -Creating a branch provides you an isolated environment with a snapshot of your repository (any part of your data lake you chose to manage on lakeFS). While working on your own branch in isolation, all other data users will be looking at the repository’s main branch. They can't see your changes, and you don’t see changes to main done after you created the branch. -No worries, no data duplication is done, it’s all metadata management behind the scenes. -Let’s look at 3 examples of a development environment and their branching models. - -### Example 1: Upgrading Spark and using Reset action - -You installed the latest version of Apache Spark. As a first step you’ll test your Spark jobs to see that the upgrade doesn't have any undesired side effects. - -For this purpose, you may create a branch (testing-spark-3.0) which will only be used to test the Spark upgrade, and discarded later. Jobs may run smoothly (the theoretical possibility exists!), or they may fail halfway through, leaving you with some intermediate partitions, data and metadata. In this case, you can simply *reset* the branch to its original state, without worrying about the intermediate results of your last experiment, and perform another (hopefully successful) test in an isolated branch. Reset actions are atomic and immediate, so no manual cleanup is required. - -Once testing is completed, and you have achieved the desired result, you can delete this experimental branch, and all data not used on any other branch will be deleted with it. - -branching_1 - -_Creating a testing branch:_ - - ```shell - lakectl branch create \ - lakefs://example-repo/testing-spark-3 \ - --source lakefs://example-repo/main - # output: - # created branch 'testing-spark-3', pointing to commit ID: '~79RU9aUsQ9GLnU' - ``` - -_Resetting changes to a branch:_ - - ```shell - lakectl branch reset lakefs://example-repo/testing-spark-3 - # are you sure you want to reset all uncommitted changes?: y█ - ``` - -**Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch. - -### Example 2: Compare - Which option is better? - -Easily compare by testing which one performs better on your data set. -Examples may be: -* Different computation tools, e.g Spark vs. Presto -* Different compression algorithms -* Different Spark configurations -* Different code versions of an ETL - -Run each experiment on its own independent branch, while the main remains untouched. Once both experiments are done, create a comparison query (using hive or presto or any other tool of your choice) to compare data characteristics, performance or any other metric you see fit. - -With lakeFS you don't need to worry about creating data paths for the experiments, copying data, and remembering to delete it. It’s substantially easier to avoid errors and maintain a clean lake after. - -branching_2 - -_Reading from and comparing branches using Spark:_ - - ```scala - val dfExperiment1 = sc.read.parquet("s3a://example-repo/experiment-1/events/by-date") - val dfExperiment2 = sc.read.parquet("s3a://example-repo/experiment-2/events/by-date") - - dfExperiment1.groupBy("...").count() - dfExperiment2.groupBy("...").count() // now we can compare the properties of the data itself - ``` - -### Example 3: Reproduce - A bug in production - -You upgraded spark and deployed changes in production. A few days or weeks later, you identify a data quality issue, a performance degradation, or an increase to your infra costs. Something that requires investigation and fixing (aka, a bug). - -lakeFS allows you to open a branch of your lake from the specific merge/commit that introduced the changes to production. Using the metadata saved on the merge/commit you can reproduce all aspects of the environment, then reproduce the issue on the branch and debug it. Meanwhile, you can revert the main to a previous point in time, or keep it as is, depending on the use case - -branching_3 - - -_Reading from a historic version (a previous commit) using Spark_ - - ```scala - // represents the data as existed at commit "~79RU9aUsQ9GLnU": - spark.read.parquet("s3://example-repo/~79RU9aUsQ9GLnU/events/by-date") - ``` - -## Continuous Integration -Everyday data lake management includes ingestion of new data collections, and a growing number of consumers reading and writing analysis results to the lake. In order to ensure our lake is reliable we need to validate new data sources, enforce good practices to maintain a clean lake (avoid the swamp) and validate metadata. lakeFS simplifies continuous integration of data to the lake by supporting ingestion on a designated branch. Merging data to main is enabled only if conditions apply. To make this tenable, let’s look at a few examples: - -### Example 1: Pre-merge hooks - enforce best practices - -Examples of good practices enforced in organizations: - - - No user_* columns except under /private/... - - Only `(*.parquet | *.orc | _delta_log/*.json)` files allowed - - Under /production, only backward-compatible schema changes are allowed - - New tables on main must be registered in our metadata repository first, with owner and SLA - -lakeFS will assist in enforcing best practices by giving you a designated branch to ingest new data (“new-data-1” in the drawing). . You may run automated tests to validate predefined best practices as pre-merge hooks. If the validation passes, the new data will be automatically and atomically merged to the main branch. However, if the validation fails, you will be alerted, and the new data will not be exposed to consumers. - -By using this branching model and implementing best practices as pre merge hooks, you ensure the main lake is never compromised. - -branching_4 - - -## Continuous Deployment -Not every day we introduce new data to the lake, or add/change ETLs, but we do have recurring jobs that are running, and updates to our existing data collections. Even if the code and infra didn't change, the data might, and those changes introduce quality issues. This is one of the complexities of a data product, the data we consume changes over the course of a month, a week, or even a single day. - -**Examples of changes to data that may occur:** - - A client-side bug in the data collection of website events - - A new Android version that interferes with the collecting events from your App - - COVID-19 abrupt impact on consumers' behavior, and its effect on the accuracy of ML models. - - During a change to Salesforce interface, the validation requirement from a certain field had been lost - -lakeFS helps you validate your expectations and assumptions from the data itself. - - -### Example 1: Pre merge hook - a data quality issue - -Continuous deployment of existing data we expect to consume, flowing from our ingest-pipelines into the lake. Similar to the Continuous Integration use-case - we create a ingest branch (“events-data”), which allows us to create tests using data analysis tools or data quality services (e.g. [Great Expectations](https://greatexpectations.io/){: target="_blank" }, [Monte Carlo](https://www.montecarlodata.com/){: target="_blank" }) to ensure reliability of the data we merge to the main branch. Since merge is atomic, no performance issue will be introduced by using lakeFS, but your main branch will only include quality data. - -branching_6 - -### Example 2: RollBack! - Data ingested from a Kafka stream - -If you introduce a new code version to production and discover it has a critical bug, you can simply roll back to the previous version. But you also need to roll back the results of running it. lakeFS gives you the power to rollback your data if you introduced low quality data. The rollback is an atomic action that prevents the data consumers from receiving low quality data until the issue is resolved. - -As previously mentioned, with lakeFS the recommended branching schema is to ingest data to a dedicated branch. When streaming data, we can decide to merge the incoming data to main at a given time interval or checkpoint, depending on how we chose to write it from Kafka. - -You can run quality tests for each merge (as presented in Example 1). Alas, tests are not perfect and we might still introduce low quality data at some point. In such a case, we can rollback main to the last known high quality commit, since our commits for streaming will include the metadata of the Kafka offset. - -branching_7 - -_Rolling back a branch to a previous commit using the CLI_ - - ```shell - lakectl branch reset lakefs://example-repo/stream-1 --commit ~79RU9aUsQ9GLnU - ``` - -**Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch. - -### Example 3: Cross collection consistency - -We often need consistency between different data collections. A few examples may be: - - To join different collections in order to create a unified view of an account, a user or another entity we measure. - - To introduce the same data in different formats - - To introduce the same data with a different leading index or sorting due to performance considerations - -lakeFS will help ensure you introduce only consistent data to your consumers by exposing the new collections and their join in one atomic action to main. Once you consumed the collections on a different branch, and only when both are synchronized, we calculated the join and merged to main. - -In this example you can see two data sets (Sales data and Marketing data) consumed each to its own independent branch, and after the write of both data sets is completed, they are merged to a different branch (leads branch) where the join ETL runs and creates a joined collection by account. The joined table is then merged to main. -The same logic can apply if the data is ingested in streaming, using standard formats, or formats that allow upsert/delete such as Apache Hudi, Delta Lake or Iceberg. - -branching_8 diff --git a/docs/community.md b/docs/community.md deleted file mode 100644 index 36d04e5324f..00000000000 --- a/docs/community.md +++ /dev/null @@ -1,23 +0,0 @@ ---- -layout: default -title: Community -description: lakeFS community. Join the community at the lakeFS slack workspace and feel free to ask questions and get help. -nav_order: 50 -has_children: false ---- - -# Community - -We're excited to hear from you! - -### Get in touch with the lakeFS team - -Join our public [Slack space](https://lakefs.io/slack). We’re extremely responsive and you can expect a fast reply. - -### Contribute - -Whether it’s a bug report, new feature, correction, or additional documentation, we greatly value feedback and contributions from our community. Our [Contributing Guide](https://docs.lakefs.io/contributing.html) is a great place to get started. - -### Ask a question - -If you have any questions or need technical support, feel free to message us in the [Slack](https://lakefs.io/slack) #help channel. diff --git a/docs/deploy/aws.md b/docs/deploy/aws.md new file mode 100644 index 00000000000..3d48cecee9e --- /dev/null +++ b/docs/deploy/aws.md @@ -0,0 +1,116 @@ +--- +layout: default +title: On AWS +parent: Deploy lakeFS +description: +nav_order: 10 +redirect_from: + - ../deploying-aws/index.html + - ../deploying-aws/install.html + - ../deploying-aws/db.html + - ../deploying-aws/lb_dns.html +--- + +# Deploy lakeFS on AWS +{: .no_toc } +Expected deployment time: 25min + +## Table of contents +{: .no_toc .text-delta } + +1. TOC +{:toc} + +{% include_relative includes/prerequisites.md %} + +## Creating the Database on AWS RDS +lakeFS requires a PostgreSQL database to synchronize actions on your repositories. +We will show you how to create a database on AWS RDS, but you can use any PostgreSQL database as long as it's accessible by your lakeFS installation. + +If you already have a database, take note of the connection string and skip to the [next step](#install-lakefs-on-ec2) + +1. Follow the official [AWS documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html){: target="_blank" } on how to create a PostgreSQL instance and connect to it. + You may use the default PostgreSQL engine, or [Aurora PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraPostgreSQL.html){: target="_blank" }. Make sure you're using PostgreSQL version >= 11. +2. Once your RDS is set up and the server is in `Available` state, take note of the endpoint and port. + + ![RDS Connection String](../assets/img/rds_conn.png) + +3. Make sure your security group rules allow you to connect to the database instance. + +## Installation Options + +### On EC2 +1. Save the following configuration file as `config.yaml`: + + ```yaml + --- + database: + connection_string: "[DATABASE_CONNECTION_STRING]" + auth: + encrypt: + # replace this with a randomly-generated string: + secret_key: "[ENCRYPTION_SECRET_KEY]" + blockstore: + type: s3 + s3: + region: us-east-1 + gateways: + s3: + # replace this with the host you will use for the lakeFS S3-compatible endpoint: + domain_name: [S3_GATEWAY_DOMAIN] + ``` + +1. [Download the binary](../index.md#downloads) to the EC2 instance. +1. Run the `lakefs` binary on the EC2 instance: + ```bash + lakefs --config config.yaml run + ``` + **Note:** it is preferable to run the binary as a service using systemd or your operating system's facilities. + +### On ECS +To support container-based environments like AWS ECS, lakeFS can be configured using environment variables. Here is a `docker run` +command to demonstrate starting lakeFS using Docker: + +```sh +docker run \ + --name lakefs \ + -p 8000:8000 \ + -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \ + -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \ + -e LAKEFS_BLOCKSTORE_TYPE="s3" \ + -e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="[S3_GATEWAY_DOMAIN]" \ + treeverse/lakefs:latest run +``` + +See the [reference](../reference/configuration.md#using-environment-variables) for a complete list of environment variables. + +### On EKS +See [Kubernetes Deployment](./k8s.md). + +## Load balancing +Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server. +By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint which you can use for health checks. + +### Notes for using an AWS Application Load Balancer +{: .no_toc } + +1. Your security groups should allow the load balancer to access the lakeFS server. +1. Create a target group with a listener for port 8000. +1. Setup TLS termination using the domain names you wish to use for both endpoints (e.g. `s3.lakefs.example.com`, `*.s3.lakefs.example.com`, `lakefs.example.com`). +1. Configure the health-check to use the exposed `/_health` URL + +## DNS on AWS Route53 +As mentioned above, you should create 3 DNS records for lakeFS: +1. One record for the lakeFS API: `lakefs.example.com` +1. Two records for the S3-compatible API: `s3.lakefs.example.com` and `*.s3.lakefs.example.com`. + +For an AWS load balancer with Route53 DNS, create a simple record, and choose *Alias to Application and Classic Load Balancer* with an `A` record type. + +![Configuring a simple record in Route53](../assets/img/route53.png) + +For other DNS providers, refer to the documentation on how to add CNAME records. + +## Next Steps +You can now move on to the [Setup](../guides/setup.md) page. + +{% include_relative includes/why-dns.md %} diff --git a/docs/deploy/azure.md b/docs/deploy/azure.md new file mode 100644 index 00000000000..bd8ce3e31d0 --- /dev/null +++ b/docs/deploy/azure.md @@ -0,0 +1,103 @@ +--- +layout: default +title: On Azure +parent: Deploy lakeFS +description: This guide will help you deploy your production lakeFS environment on Azure +nav_order: 20 +--- + +# Deploy lakeFS on Azure +{: .no_toc } +Expected deployment time: 25min + +## Table of contents +{: .no_toc .text-delta } + +1. TOC +{:toc} + +{% include_relative includes/prerequisites.md %} + +## Creating the Database on Azure Database +lakeFS requires a PostgreSQL database to synchronize actions on your repositories. +We will show you how to create a database on Azure Database, but you can use any PostgreSQL database as long as it's accessible by your lakeFS installation. + +If you already have a database, take note of the connection string and skip to the [next step](#install-lakefs-on-azure-vm) + +1. Follow the official [Azure documentation](https://docs.microsoft.com/en-us/azure/postgresql/quickstart-create-server-database-portal){: target="_blank" } on how to create a PostgreSQL instance and connect to it. + Make sure you're using PostgreSQL version >= 11. +1. Once your Azure Database for PostgreSQL server is set up and the server is in `Available` state, take note of the endpoint and username. + ![Azure postgres Connection String](../assets/img/azure_postgres_conn.png) +1. Make sure your Access control roles allow you to connect to the database instance. + +## Installation Options + +### On Azure VM +1. Save the following configuration file as `config.yaml`: + + ```yaml + --- + database: + connection_string: "[DATABASE_CONNECTION_STRING]" + auth: + encrypt: + # replace this with a randomly-generated string: + secret_key: "[ENCRYPTION_SECRET_KEY]" + blockstore: + type: azure + azure: + auth_method: msi # msi for active directory, access-key for access key + # In case you chose to authenticate via access key unmark the following rows and insert the values from the previous step + # storage_account: [your storage account] + # storage_access_key: [your access key] + gateways: + s3: + # replace this with the host you will use for the lakeFS S3-compatible endpoint: + domain_name: [S3_GATEWAY_DOMAIN] + ``` + +1. [Download the binary](../index.md#downloads) to the Azure Virtual Machine. +1. Run the `lakefs` binary on the machine: + ```bash + lakefs --config config.yaml run + ``` + **Note:** it is preferable to run the binary as a service using systemd or your operating system's facilities. +1. To support Azure AD authentication go to `Identity` tab and switch `Status` toggle to on, then add the `Storage Blob Data Contributor' role on the container you created. + +### On Azure Container instances +To support container-based environments like Azure Container Instances, lakeFS can be configured using environment variables. Here is a `docker run` +command to demonstrate starting lakeFS using Docker: + +```sh +docker run \ + --name lakefs \ + -p 8000:8000 \ + -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \ + -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \ + -e LAKEFS_BLOCKSTORE_TYPE="azure" \ + -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT="[YOUR_STORAGE_ACCOUNT]" \ + -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCESS_KEY="[YOUR_ACCESS_KEY]" \ + -e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="[S3_GATEWAY_DOMAIN]" \ + treeverse/lakefs:latest run +``` + +See the [reference](../reference/configuration.md#using-environment-variables) for a complete list of environment variables. + +### On AKS +See [Kubernetes Deployment](./k8s.md). + +## Load balancing +Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server. +By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint which you can use for health checks. + +## DNS +As mentioned above, you should create 3 DNS records for lakeFS: +1. One record for the lakeFS API: `lakefs.example.com` +1. Two records for the S3-compatible API: `s3.lakefs.example.com` and `*.s3.lakefs.example.com`. + +Depending on your DNS provider, refer to the documentation on how to add CNAME records. + +## Next Steps +You can now move on to the [Setup](../guides/setup.md) page. + +{% include_relative includes/why-dns.md %} \ No newline at end of file diff --git a/docs/deploy/docker.md b/docs/deploy/docker.md new file mode 100644 index 00000000000..6d9c547745b --- /dev/null +++ b/docs/deploy/docker.md @@ -0,0 +1,70 @@ +--- +layout: default +title: On Docker +parent: Deploy lakeFS +description: This guide will help you deploy your production lakeFS environment with Docker. +nav_order: 50 +--- +# Deploy lakeFS on Docker +{: .no_toc } + +## Database +{: .no_toc } + +lakeFS requires a PostgreSQL database to synchronize actions on your repositories. +This section assumes you already have a PostgreSQL database accessible from where you intend to install lakeFS. +Instructions for creating the database can be found on the deployment instructions for [AWS](./aws.md#creating-the-database-on-aws-rds), [Azure](./azure.md#creating-the-database-on-azure-database) and [GCP](./gcp.md#creating-the-database-on-gcp-sql). + +## Table of contents +{: .no_toc .text-delta } + +1. TOC +{:toc} + +{% include_relative includes/prerequisites.md %} + +## Installing on Docker +To deploy using Docker, create a yaml configuration file. +Here is a minimal example, but you can see the [reference](../reference/configuration.md#example-aws-deployment) for the full list of configurations. +
+ +
+{% include_relative includes/aws-docker-config.md %} +
+
+{% include_relative includes/gcp-docker-config.md %} +
+
+{% include_relative includes/azure-docker-config.md %} +
+
+ +Save the configuration file locally as `lakefs-config.yaml` and run the following command: + +```sh +docker run \ + --name lakefs \ + -p 8000:8000 \ + -v $(pwd)/lakefs-config.yaml:/home/lakefs/.lakefs.yaml \ + treeverse/lakefs:latest run +``` + +## Load balancing +You should have a load balancer direct requests to the lakeFS server. +By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint which you can use for health checks. + +## DNS +As mentioned above, you should create 3 DNS records for lakeFS: +1. One record for the lakeFS API: `lakefs.example.com` +1. Two records for the S3-compatible API: `s3.lakefs.example.com` and `*.s3.lakefs.example.com`. + +All records should point to your Load Balancer, preferably with a short TTL value. + +## Next Steps +You can now move on to the [Setup](../guides/setup.md) page. + +{% include_relative includes/why-dns.md %} diff --git a/docs/deploy/gcp.md b/docs/deploy/gcp.md new file mode 100644 index 00000000000..c9d92c48462 --- /dev/null +++ b/docs/deploy/gcp.md @@ -0,0 +1,103 @@ +--- +layout: default +title: On GCP +parent: Deploy lakeFS +description: This guide will help you deploy your production lakeFS environment on GCP +nav_order: 30 +--- + +# Deploy lakeFS on GCP +{: .no_toc } +Expected deployment time: 25min + +## Table of contents +{: .no_toc .text-delta } + +1. TOC +{:toc} + +{% include_relative includes/prerequisites.md %} + +## Creating the Database on GCP SQL +lakeFS requires a PostgreSQL database to synchronize actions on your repositories. +We will show you how to create a database on Google Cloud SQL, but you can use any PostgreSQL database as long as it's accessible by your lakeFS installation. + +If you already have a database, take note of the connection string and skip to the [next step](#install-lakefs-on-ec2) + +1. Follow the official [Google documentation](https://cloud.google.com/sql/docs/postgres/quickstart#create-instance) on how to create a PostgreSQL instance. + Make sure you're using PostgreSQL version >= 11. +1. On the *Users* tab in the console, create a user to be used by the lakeFS installation. +1. Choose the method by which lakeFS [will connect to your database](https://cloud.google.com/sql/docs/postgres/connect-overview). Google recommends using + the [SQL Auth Proxy](https://cloud.google.com/sql/docs/postgres/sql-proxy). + +Depending on the chosen lakeFS installation method, you will need to make sure lakeFS can access your database. +For example, if you install lakeFS on GKE, you need to deploy the SQL Auth Proxy from [this Helm chart](https://github.com/rimusz/charts/blob/master/stable/gcloud-sqlproxy/README.md), or as [a sidecar container in your lakeFS pod](https://cloud.google.com/sql/docs/mysql/connect-kubernetes-engine). + +You can now proceed to [Configuring the Storage](bucket.md). + +## Installation Options + +### On Google Compute Engine +1. Save the following configuration file as `config.yaml`: + + ```yaml + --- + database: + connection_string: "[DATABASE_CONNECTION_STRING]" + auth: + encrypt: + # replace this with a randomly-generated string: + secret_key: "[ENCRYPTION_SECRET_KEY]" + blockstore: + type: gs + # Uncomment the following lines to give lakeFS access to your buckets using a service account: + # gs: + # credentials_json: [YOUR SERVICE ACCOUNT JSON STRING] + gateways: + s3: + # replace this with the host you will use for the lakeFS S3-compatible endpoint: + domain_name: [S3_GATEWAY_DOMAIN] + ``` + +1. [Download the binary](../index.md#downloads) to the GCE instance. +1. Run the `lakefs` binary on the GCE machine: + ```bash + lakefs --config config.yaml run + ``` + **Note:** it is preferable to run the binary as a service using systemd or your operating system's facilities. + +### On Google Cloud Run +To support container-based environments like Google Cloud Run, lakeFS can be configured using environment variables. Here is a `docker run` +command to demonstrate starting lakeFS using Docker: + +```sh +docker run \ + --name lakefs \ + -p 8000:8000 \ + -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \ + -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \ + -e LAKEFS_BLOCKSTORE_TYPE="gs" \ + -e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="[S3_GATEWAY_DOMAIN]" \ + treeverse/lakefs:latest run +``` + +See the [reference](../reference/configuration.md#using-environment-variables) for a complete list of environment variables. + +### On GKE +See [Kubernetes Deployment](./k8s.md). + +## Load balancing +Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server. +By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint which you can use for health checks. + +## DNS +As mentioned above, you should create 3 DNS records for lakeFS: +1. One record for the lakeFS API: `lakefs.example.com` +1. Two records for the S3-compatible API: `s3.lakefs.example.com` and `*.s3.lakefs.example.com`. + +Depending on your DNS provider, refer to the documentation on how to add CNAME records. + +## Next Steps +You can now move on to the [Setup](../guides/setup.md) page. + +{% include_relative includes/why-dns.md %} \ No newline at end of file diff --git a/docs/deploying-aws/installation-methods/aws-docker-config.md b/docs/deploy/includes/aws-docker-config.md similarity index 100% rename from docs/deploying-aws/installation-methods/aws-docker-config.md rename to docs/deploy/includes/aws-docker-config.md diff --git a/docs/deploying-aws/installation-methods/aws-docker-run.md b/docs/deploy/includes/aws-docker-run.md similarity index 100% rename from docs/deploying-aws/installation-methods/aws-docker-run.md rename to docs/deploy/includes/aws-docker-run.md diff --git a/docs/deploying-aws/installation-methods/aws-helm-values.md b/docs/deploy/includes/aws-helm-values.md similarity index 100% rename from docs/deploying-aws/installation-methods/aws-helm-values.md rename to docs/deploy/includes/aws-helm-values.md diff --git a/docs/deploying-aws/installation-methods/azure-docker-config.md b/docs/deploy/includes/azure-docker-config.md similarity index 100% rename from docs/deploying-aws/installation-methods/azure-docker-config.md rename to docs/deploy/includes/azure-docker-config.md diff --git a/docs/deploying-aws/installation-methods/azure-docker-run.md b/docs/deploy/includes/azure-docker-run.md similarity index 100% rename from docs/deploying-aws/installation-methods/azure-docker-run.md rename to docs/deploy/includes/azure-docker-run.md diff --git a/docs/deploying-aws/installation-methods/azure-helm-values.md b/docs/deploy/includes/azure-helm-values.md similarity index 100% rename from docs/deploying-aws/installation-methods/azure-helm-values.md rename to docs/deploy/includes/azure-helm-values.md diff --git a/docs/deploying-aws/installation-methods/gcp-docker-config.md b/docs/deploy/includes/gcp-docker-config.md similarity index 100% rename from docs/deploying-aws/installation-methods/gcp-docker-config.md rename to docs/deploy/includes/gcp-docker-config.md diff --git a/docs/deploying-aws/installation-methods/gcp-docker-run.md b/docs/deploy/includes/gcp-docker-run.md similarity index 100% rename from docs/deploying-aws/installation-methods/gcp-docker-run.md rename to docs/deploy/includes/gcp-docker-run.md diff --git a/docs/deploying-aws/installation-methods/gcp-helm-values.md b/docs/deploy/includes/gcp-helm-values.md similarity index 100% rename from docs/deploying-aws/installation-methods/gcp-helm-values.md rename to docs/deploy/includes/gcp-helm-values.md diff --git a/docs/deploy/includes/prerequisites.md b/docs/deploy/includes/prerequisites.md new file mode 100644 index 00000000000..c52748acef8 --- /dev/null +++ b/docs/deploy/includes/prerequisites.md @@ -0,0 +1,10 @@ +## Prerequisites + +{: .no_toc } +A production-suitable lakeFS installation will require three DNS records **pointing at your lakeFS server**. +A good convention for those will be, assuming you already own the domain `example.com`: +* `lakefs.example.com` +* `s3.lakefs.example.com` - **this is the S3 Gateway Domain** +* `*.s3.lakefs.example.com` + +The second record, the *S3 Gateway Domain*, is used in lakeFS configuration to differentiate between the S3 Gateway API and the OpenAPI Server. For more info, see [Why do I need these three DNS records?](#why-do-i-need-the-three-dns-records) diff --git a/docs/deploy/includes/why-dns.md b/docs/deploy/includes/why-dns.md new file mode 100644 index 00000000000..f8bb8bf2bb3 --- /dev/null +++ b/docs/deploy/includes/why-dns.md @@ -0,0 +1,12 @@ +## Why do I need the three DNS records? +{: .no_toc } + +Multiple DNS records are needed to access the two different lakeFS APIs (covered in more detail in the [Architecture](../architecture/overview.md) section): + +1. **The lakeFS OpenAPI**: used by the `lakectl` CLI tool. Exposes git-like operations (branching, diffing, merging etc.). +1. **An S3-compatible API**: read and write your data in any tool that can communicate with S3. Examples include: AWS CLI, Boto, Presto and Spark. + +lakeFS actually exposes only one API endpoint. For every request, lakeFS checks the `Host` header. +If the header is under the S3 gateway domain, the request is directed to the S3-compatible API. + +The third DNS record (`*.s3.lakefs.example.com`) allows for [virtual-host style access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html). This is a way for AWS clients to specify the bucket name in the Host subdomain. diff --git a/docs/deploy/index.md b/docs/deploy/index.md new file mode 100644 index 00000000000..87500f19485 --- /dev/null +++ b/docs/deploy/index.md @@ -0,0 +1,13 @@ +--- +layout: default +title: Deploy lakeFS +description: This section will guide you through deploying a production-suitable lakeFS environment. +nav_order: 10 +has_children: true +--- + +# Deploy lakeFS + +This page contains a collection of practical step-by-step instructions to help you set up lakeFS on your preferred cloud environemnt. +If you just want to try out lakeFS locally, see [Quickstart](../quickstart/index.md). + diff --git a/docs/deploy/k8s.md b/docs/deploy/k8s.md new file mode 100644 index 00000000000..3d0053816c3 --- /dev/null +++ b/docs/deploy/k8s.md @@ -0,0 +1,85 @@ +--- +layout: default +title: On Kubernetes +parent: Deploy lakeFS +description: This guide will help you deploy your production lakeFS environment on Kubernetes using a helm chart +nav_order: 40 +--- + + +# Deploy lakeFS on Kubernetes +{: .no_toc } + +## Database +{: .no_toc } + +lakeFS requires a PostgreSQL database to synchronize actions on your repositories. +This section assumes you already have a PostgreSQL database accessible from your Kubernetes cluster. +Instructions for creating the database can be found on the deployment instructions for [AWS](./aws.md#creating-the-database-on-aws-rds), [Azure](./azure.md#creating-the-database-on-azure-database) and [GCP](./gcp.md#creating-the-database-on-gcp-sql). + +## Table of contents +{: .no_toc .text-delta } + +1. TOC +{:toc} + +{% include_relative includes/prerequisites.md %} + +## Installing on Kuberneets + +lakeFS can be easily installed on Kubernetes using a [Helm chart](https://github.com/treeverse/charts/tree/master/charts/lakefs). +To install lakeFS with Helm: +1. Copy the Helm values file relevant to your storage provider: +
+ +
+ {% include_relative includes/aws-helm-values.md %} +
+
+ {% include_relative includes/gcp-helm-values.md %} +
+
+ {% include_relative includes/azure-helm-values.md %} +
+
+ +1. Fill in the missing values and save the file as `conf-values.yaml`. For more configuration options, see our Helm chart [README](https://github.com/treeverse/charts/blob/master/charts/lakefs/README.md#custom-configuration){:target="_blank"}. + + The `lakefsConfig` parameter is the lakeFS configuration documented [here](https://docs.lakefs.io/reference/configuration.html), but without sensitive information. + Sensitive information like `databaseConnectionString` is given through separate parameters, and the chart will inject them into Kubernetes secrets. + +1. In the directory where you created `conf-values.yaml`, run the following commands: + + ```bash + # Add the lakeFS repository + helm repo add lakefs https://charts.lakefs.io + # Deploy lakeFS + helm install example-lakefs lakefs/lakefs -f conf-values.yaml + ``` + + *example-lakefs* is the [Helm Release](https://helm.sh/docs/intro/using_helm/#three-big-concepts) name. + +You should give your Kubernetes nodes access to all buckets/containers you intend to use lakeFS with. +If you can't provide such access, lakeFS can be configured to use an AWS key-pair, an Azure access key, or a Google Cloud credentials file to authenticate (part of the `lakefsConfig` YAML below). +{: .note .note-info } + +## Load balancing +You should have a load balancer direct requests to the lakeFS server. +Options to do so include a Kubernetes Service of type `LoadBalancer`, or a Kubernetes Ingress. +By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint which you can use for health checks. + +## DNS +As mentioned above, you should create 3 DNS records for lakeFS: +1. One record for the lakeFS API: `lakefs.example.com` +1. Two records for the S3-compatible API: `s3.lakefs.example.com` and `*.s3.lakefs.example.com`. + +All records should point to your Load Balancer, preferably with a short TTL value. + +## Next Steps +You can now move on to the [Setup](../guides/setup.md) page. + +{% include_relative includes/why-dns.md %} diff --git a/docs/deploying-aws/bucket.md b/docs/deploying-aws/bucket.md deleted file mode 100644 index b1372060fc1..00000000000 --- a/docs/deploying-aws/bucket.md +++ /dev/null @@ -1,102 +0,0 @@ ---- -layout: default -title: Configuring the Storage -description: Providing the data storage layer for our installation. -parent: Production Deployment -nav_order: 15 -has_children: false ---- - -# Configuring the Storage -{: .no_toc } - -A production installation of lakeFS will usually use your cloud provider's object storage as the underlying storage layer. -You can choose to create a new bucket/container (recommended), or use an existing one with a path prefix. The path under the existing bucket/container should be empty. - -After you have a bucket/container configured, proceed to [Installing lakeFS](./install.md). - -Choose your cloud provider to configure your storage. - -## Table of contents -{: .no_toc .text-delta } - -1. TOC -{:toc} - -## AWS S3 - -1. From the S3 Administration console, choose `Create Bucket`. -2. Make sure you: - 1. Block public access - 2. Disable Object Locking -3. Go to the `Permissions` tab, and create a Bucket Policy. Use the following structure: - - ```json - { - "Id": "Policy1590051531320", - "Version": "2012-10-17", - "Statement": [ - { - "Sid": "Stmt1590051522178", - "Action": [ - "s3:GetObject", - "s3:GetObjectVersion", - "s3:PutObject", - "s3:AbortMultipartUpload", - "s3:ListMultipartUploadParts", - "s3:GetBucketVersioning", - "s3:ListBucket", - "s3:GetBucketLocation", - "s3:ListBucketMultipartUploads", - "s3:ListBucketVersions" - ], - "Effect": "Allow", - "Resource": ["arn:aws:s3:::", "arn:aws:s3:::/*"], - "Principal": { - "AWS": ["arn:aws:iam:::role/"] - } - } - ] - } - ``` - - Replace ``, `` and `` with values relevant to your environment. - `IAM_ROLE` should be the role assumed by your lakeFS installation. - - Alternatively, if you use an AWS user's key-pair to authenticate lakeFS to AWS, change the policy's Principal to be the user: - - ```json - "Principal": { - "AWS": ["arn:aws:iam:::user/"] - } - ``` -You can now proceed to [Installing lakeFS](./install.md). - -## Microsoft Azure Blob Storage - -[Create a container in Azure portal](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal#create-a-container) -1. From the Azure portal, Storage Accounts, choose your account, then in the container tab click `+ Container`. -2. Make sure you block public access - -### Authenticate with Secret Key -{: .no_toc } - -In case you want to use the secret key for authentication you will need to use the account key in the configuration -Go to the `Access Keys` tab and click on `Show Keys` save the values under `Storage account name` and `Key` we will need them in the [installing lakeFS](install.md) step -### Authenticate with Active Directory -{: .no_toc } - -In case you want your lakeFS Installation (we will install in the next step) to access this Container using Active Directory authentication, -First go to the container you created in step 1. -* Go to `Access Control (IAM)` -* Go to the `Role assignments` tab -* Add the `Storage Blob Data Contributor` role to the Installation running lakeFS. - -You can now proceed to [Installing lakeFS](./install.md). - -## Google Cloud Storage -1. On the Google Cloud Storage console, click *Create Bucket*. Follow the instructions. - -1. On the *Permissions* tab, add the service account you intend to use lakeFS with. Give it a role that allows reading and writing to the bucket, e.g. *Storage Object Creator*. - -You can now proceed to [Installing lakeFS](./install.md). \ No newline at end of file diff --git a/docs/deploying-aws/db.md b/docs/deploying-aws/db.md deleted file mode 100644 index 307574675d7..00000000000 --- a/docs/deploying-aws/db.md +++ /dev/null @@ -1,58 +0,0 @@ ---- -layout: default -title: Creating the database -description: Creating the database. Before installing lakeFS, you need to have a PostgreSQL database. -parent: Production Deployment -nav_order: 10 -has_children: false ---- - -# Creating the database -{: .no_toc } - -lakeFS requires a PostgreSQL database to synchronize actions on your repositories. -We will show you how to create a database on your cloud platform. -You can use any PostgreSQL database as long as it's accessible by your lakeFS installation. - -If you already have a database, take note of the connection string and proceed to [Configuring the Storage](bucket.md). - -## Table of contents -{: .no_toc .text-delta } - -1. TOC -{:toc} - -## On AWS RDS - -1. Follow the official [AWS documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html){: target="_blank" } on how to create a PostgreSQL instance and connect to it. -You may use the default PostgreSQL engine, or [Aurora PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraPostgreSQL.html){: target="_blank" }. Make sure you're using PostgreSQL version >= 11. -2. Once your RDS is set up and the server is in `Available` state, take note of the endpoint and port. - - ![RDS Connection String](../assets/img/rds_conn.png) - -3. Make sure your security group rules allow you to connect to the database instance. - -You can now proceed to [Configuring the Storage](bucket.md). - -## On Microsoft Azure Database - -1. Follow the official [Azure documentation](https://docs.microsoft.com/en-us/azure/postgresql/quickstart-create-server-database-portal){: target="_blank" } on how to create a PostgreSQL instance and connect to it. - Make sure you're using PostgreSQL version >= 11. -1. Once your Azure Database for PostgreSQL server is set up and the server is in `Available` state, take note of the endpoint and username. - ![Azure postgres Connection String](../assets/img/azure_postgres_conn.png) -1. Make sure your Access control roles allow you to connect to the database instance. - -You can now proceed to [Configuring the Storage](bucket.md). - -## On Google Cloud SQL - -1. Follow the official [Google documentation](https://cloud.google.com/sql/docs/postgres/quickstart#create-instance) on how to create a PostgreSQL instance. - Make sure you're using PostgreSQL version >= 11. -1. On the *Users* tab in the console, create a user to be used by the lakeFS installation. -1. Choose the method by which lakeFS [will connect to your database](https://cloud.google.com/sql/docs/postgres/connect-overview). Google recommends using - the [SQL Auth Proxy](https://cloud.google.com/sql/docs/postgres/sql-proxy). - -Depending on the chosen lakeFS installation method, you will need to make sure lakeFS can access your database. -For example, if you install lakeFS on GKE, you need to deploy the SQL Auth Proxy from [this Helm chart](https://github.com/rimusz/charts/blob/master/stable/gcloud-sqlproxy/README.md), or as [a sidecar container in your lakeFS pod](https://cloud.google.com/sql/docs/mysql/connect-kubernetes-engine). - -You can now proceed to [Configuring the Storage](bucket.md). diff --git a/docs/deploying-aws/index.md b/docs/deploying-aws/index.md deleted file mode 100644 index 07f28c5ac2d..00000000000 --- a/docs/deploying-aws/index.md +++ /dev/null @@ -1,11 +0,0 @@ ---- -layout: default -title: Production Deployment -description: This section will guide you through deploying a production-suitable lakeFS environment. -nav_order: 10 -has_children: true ---- - -This section will guide you through setting up lakeFS on your cloud provider. The first step is [Creating the Database](db.md). - -If you just want to try out lakeFS locally, see [Quick Start](../quickstart/index.md). diff --git a/docs/deploying-aws/install.md b/docs/deploying-aws/install.md deleted file mode 100644 index 76c50f6f596..00000000000 --- a/docs/deploying-aws/install.md +++ /dev/null @@ -1,159 +0,0 @@ ---- -layout: default -title: Installing lakeFS -description: Installing lakeFS is easy. This section covers common deployment options for installing lakeFS. -parent: Production Deployment -nav_order: 20 -has_children: false ---- - -# Installing lakeFS -{: .no_toc } - -For production deployments, install the lakeFS binary on your host of choice. - -## Preqrequisites -{: .no_toc } -A production-suitable lakeFS installation will require three DNS records **pointing at your lakeFS server**. -A good convention for those will be, assuming you already own the domain `example.com`: - - * `lakefs.example.com` - * `s3.lakefs.example.com` - **this is the S3 Gateway Domain** - * `*.s3.lakefs.example.com` - - -The second record, the *S3 Gateway Domain*, is used in lakeFS configuration to differentiate between the S3 Gateway API and the OpenAPI Server. For more info, see [Why do I need these three DNS records?](#why-do-i-need-the-three-dns-records) - -Find your preferred installation method: - -1. TOC -{:toc} - -## Kubernetes with Helm - -lakeFS can be easily installed on Kubernetes using a [Helm chart](https://github.com/treeverse/charts/tree/master/charts/lakefs). -To install lakeFS with Helm: -1. Copy the Helm values file relevant to your cloud provider: -
- -
- {% include_relative installation-methods/aws-helm-values.md %} -
-
- {% include_relative installation-methods/gcp-helm-values.md %} -
-
- {% include_relative installation-methods/azure-helm-values.md %} -
-
- -1. Fill in the missing values and save the file as `conf-values.yaml`. For more configuration options, see our Helm chart [README](https://github.com/treeverse/charts/blob/master/charts/lakefs/README.md#custom-configuration){:target="_blank"}. - - The `lakefsConfig` parameter is the lakeFS configuration documented [here](https://docs.lakefs.io/reference/configuration.html), but without sensitive information. - Sensitive information like `databaseConnectionString` is given through separate parameters, and the chart will inject them into Kubernetes secrets. - -1. In the directory where you created `conf-values.yaml`, run the following commands: - - ```bash - # Add the lakeFS repository - helm repo add lakefs https://charts.lakefs.io - # Deploy lakeFS - helm install example-lakefs lakefs/lakefs -f conf-values.yaml - ``` - - *example-lakefs* is the [Helm Release](https://helm.sh/docs/intro/using_helm/#three-big-concepts) name. - -You should give your Kubernetes nodes access to all buckets/containers you intend to use lakeFS with. -If you can't provide such access, lakeFS can be configured to use an AWS key-pair, an Azure access key, or a Google Cloud credentials file to authenticate (part of the `lakefsConfig` YAML below). -{: .note .note-info } - -Once your installation is running, move on to [Load Balancing and DNS](./lb_dns.md). - -## Docker -To deploy using Docker, create a yaml configuration file. -Here is a minimal example, but you can see the [reference](../reference/configuration.md#example-aws-deployment) for the full list of configurations. -
- -
-{% include_relative installation-methods/aws-docker-config.md %} -
-
-{% include_relative installation-methods/gcp-docker-config.md %} -
-
-{% include_relative installation-methods/azure-docker-config.md %} -
-
- -Save the configuration file locally as `lakefs-config.yaml` and run the following command: - -```sh -docker run \ - --name lakefs \ - -p 8000:8000 \ - -v $(pwd)/lakefs-config.yaml:/home/lakefs/.lakefs.yaml \ - treeverse/lakefs:latest run -``` - -Once your installation is running, move on to [Load Balancing and DNS](./lb_dns.md). - -## AWS ECS / Google Cloud Run / Azure Container Instances - -Some environments make it harder to use a configuration file, and are best configured using environment variables. -All lakeFS configurations can be given through environment variables, see the [reference](../reference/configuration.md#using-environment-variables) for the full list of configurations. - -These configurations can be used to run lakeFS on container orchestration service providers like AWS ECS, Google Cloud Run , or Azure Container Instances. -Here is a `docker run` command to demonstrate the use of environment variables: - -
- -
-{% include_relative installation-methods/aws-docker-run.md %} -
-
-{% include_relative installation-methods/gcp-docker-run.md %} -
-
-{% include_relative installation-methods/azure-docker-run.md %} -
-
- -Once your installation is running, move on to [Load Balancing and DNS](./lb_dns.md). - -## AWS EC2 / Google Compute Engine / Azure Virtual Machine -Run lakeFS directly on a cloud instance: - -1. [Download the binary for your operating system](../downloads.md) -2. `lakefs` is a single binary, you can run it directly, but preferably run it as a service using systemd or your operating system's facilities. - - ```bash - lakefs --config run - ``` -3. To support azure AD authentication go to `Identity` tab and switch `Status` toggle to on, then add the `Storage Blob Data Contributer' role on the container you created. - -Once your installation is running, move on to [Load Balancing and DNS](./lb_dns.md). - -## Why do I need the three DNS records? -{: .no_toc } - -Multiple DNS records are needed to access the two different lakeFS APIs (covered in more detail in the [Architecture](../architecture/overview.md) section): - -1. **The lakeFS OpenAPI**: used by the `lakectl` CLI tool. Exposes git-like operations (branching, diffing, merging etc.). -1. **An S3-compatible API**: read and write your data in any tool that can communicate with S3. Examples include: AWS CLI, Boto, Presto and Spark. - -lakeFS actually exposes only one API endpoint. For every request, lakeFS checks the `Host` header. -If the header is under the S3 gateway domain, the request is directed to the S3-compatible API. - -The third DNS record (`*.s3.lakefs.example.com`) allows for [virtual-host style access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html). This is a way for AWS clients to specify the bucket name in the Host subdomain. diff --git a/docs/deploying-aws/lb_dns.md b/docs/deploying-aws/lb_dns.md deleted file mode 100644 index 886ee8b7f7b..00000000000 --- a/docs/deploying-aws/lb_dns.md +++ /dev/null @@ -1,41 +0,0 @@ ---- -layout: default -title: Load Balancing and DNS -description: Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server. -parent: Production Deployment -nav_order: 25 -has_children: false ---- -# Load Balancing and DNS - -This page covers how to point your Load Balancer to lakeFS, and how to set the DNS records. -If you already have those, move on to the [Setup](setup.md) page. - -## Load balancing -Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server. -By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint which you can use for health checks. - -### Notes for using an AWS Application Load Balancer - -1. Your security groups should allow the load balancer to access the lakeFS server. -1. Create a target group with a listener for port 8000. -1. Setup TLS termination using the domain names you wish to use for both endpoints (e.g. `s3.lakefs.example.com`, `*.s3.lakefs.example.com`, `lakefs.example.com`). -1. Configure the health-check to use the exposed `/_health` URL - -## DNS - -As mentioned in a previous step, you should create 3 DNS records for lakeFS: -1. One record for the lakeFS API: `lakefs.example.com` -1. Two records for the S3-compatible API: `s3.lakefs.example.com` and `*.s3.lakefs.example.com`. - -All records should point to your Load Balancer, preferably with a short TTL value. - -### In AWS Route53 -For an AWS load balancer with Route53 DNS, create a simple record, and choose *Alias to Application and Classic Load Balancer* with an `A` record type. - -![Configuring a simple record in Route53](../assets/img/route53.png) - -For other DNS providers, refer to the documentation on how to add CNAME records. - -You can now move on to the [Setup](setup.md) page. -c \ No newline at end of file diff --git a/docs/downloads.md b/docs/downloads.md deleted file mode 100644 index c07fcdac5ad..00000000000 --- a/docs/downloads.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -layout: default -title: Downloads -description: This section provides information on downloads of binary packeges and official docker images. -nav_order: 20 -has_children: false ---- - -# Downloads -{: .no_toc } - -## Table of contents -{: .no_toc .text-delta } - -1. TOC -{:toc} - - -## Binary Releases - -Binary packages are available for Linux/macOS/Windows on [GitHub Releases](https://github.com/treeverse/lakeFS/releases){: target="_blank" } - -## Docker Images - -Official Docker images are available at [https://hub.docker.com/r/treeverse/lakefs](https://hub.docker.com/r/treeverse/lakefs){: target="_blank" } diff --git a/docs/branching/model.md b/docs/guides/branching-model.md similarity index 93% rename from docs/branching/model.md rename to docs/guides/branching-model.md index b8cc56299ae..178dc882edc 100644 --- a/docs/branching/model.md +++ b/docs/guides/branching-model.md @@ -1,10 +1,13 @@ --- layout: default -title: Introduction -description: When creating a new branch in lakeFS, we are actually creating a consistent snapshot of the entire repository -parent: Branching Model +title: Branching Model +description: This page explains how lakeFS uses a Git-like branching model at its core. +parent: Guides has_children: false -nav_order: 1 +nav_order: 3 +redirect_from: + - ../branching/ + - ../branching/model.html --- # Branching Model diff --git a/docs/hooks.md b/docs/guides/hooks.md similarity index 98% rename from docs/hooks.md rename to docs/guides/hooks.md index c0afe9120d8..1061edb2910 100644 --- a/docs/hooks.md +++ b/docs/guides/hooks.md @@ -1,8 +1,11 @@ --- layout: default title: Hooks -nav_order: 40 +parent: Guides +description: lakeFS allows the configuration of hooks to trigger when predefined events occur +nav_order: 2 has_children: false +redirect_from: ../hooks.html --- # Configurable Hooks diff --git a/docs/reference/import-mvcc.md b/docs/guides/import-mvcc.md similarity index 95% rename from docs/reference/import-mvcc.md rename to docs/guides/import-mvcc.md index 7db791d4f08..0ec72a888c8 100644 --- a/docs/reference/import-mvcc.md +++ b/docs/guides/import-mvcc.md @@ -19,8 +19,8 @@ has_children: false ## Copying using external tools -In order to import existing data to lakeFS, you may choose to copy it using [S3 CLI](../using/aws_cli.md#copy-from-a-local-path-to-lakefs) -or using tools like [Apache DistCp](../using/distcp.md#from-s3-to-lakefs). This is the most straightforward way, and we recommend it if it’s applicable for you. +In order to import existing data to lakeFS, you may choose to copy it using [S3 CLI](../integrations/aws_cli.md#copy-from-a-local-path-to-lakefs) +or using tools like [Apache DistCp](../integrations/distcp.md#from-s3-to-lakefs). This is the most straightforward way, and we recommend it if it’s applicable for you. ## Limitations diff --git a/docs/reference/import.md b/docs/guides/import.md similarity index 96% rename from docs/reference/import.md rename to docs/guides/import.md index a9602de8b69..ccc633188d1 100644 --- a/docs/reference/import.md +++ b/docs/guides/import.md @@ -1,15 +1,16 @@ --- layout: default -title: Importing data from existing Object Store +title: Importing data into lakeFS description: In order to import existing data to lakeFS, you may choose to copy it using S3 CLI or using tools like Apache DistCp. -parent: Reference +parent: Guides nav_order: 8 has_children: false +redirect_from: ../reference/import.html --- This page describes importing from versions >= v0.24.0. For ealier versions, see [mvcc import](import-mvcc.md) {: .note .pb-3 } -# Importing data from existing Object Store +# Importing data into lakeFS {: .no_toc } ## Table of contents @@ -20,8 +21,8 @@ This page describes importing from versions >= v0.24.0. For ealier versions, see ## Copying using external tools -In order to import existing data to lakeFS, you may choose to copy it using [S3 CLI](../using/aws_cli.md#copy-from-a-local-path-to-lakefs) -or using tools like [Apache DistCp](../using/distcp.md#from-s3-to-lakefs). This is the most straightforward way, and we recommend it if it’s applicable for you. +In order to import existing data to lakeFS, you may choose to copy it using [S3 CLI](../integrations/aws_cli.md#copy-from-a-local-path-to-lakefs) +or using tools like [Apache DistCp](../integrations/distcp.md#from-s3-to-lakefs). This is the most straightforward way, and we recommend it if it’s applicable for you. ## Limitations Unfortunately, copying data is not always feasible for the following reasons: diff --git a/docs/guides/index.md b/docs/guides/index.md new file mode 100644 index 00000000000..8930473aced --- /dev/null +++ b/docs/guides/index.md @@ -0,0 +1,7 @@ +--- +layout: default +title: Guides +description: +nav_order: 25 +has_children: true +--- diff --git a/docs/deploying-aws/setup.md b/docs/guides/setup.md similarity index 71% rename from docs/deploying-aws/setup.md rename to docs/guides/setup.md index 7480ccedc93..1b482ad733f 100644 --- a/docs/deploying-aws/setup.md +++ b/docs/guides/setup.md @@ -2,9 +2,11 @@ layout: default title: Setup description: This section outlines how to setup your environment once lakeFS is configured and running -parent: Production Deployment -nav_order: 27 +parent: Guides +nav_order: 1 has_children: false +redirect_from: + - ../deploying-aws/setup.html --- # Setup @@ -24,12 +26,12 @@ Once we have lakeFS configured and running, open `https:// [flags] -h, --help help for stage --location string fully qualified storage location (i.e. "s3://bucket/path/to/object") --meta strings key value pairs in the form of key=value - --mtime int Object modified time (Unix Epoch in seconds). Defaults to current time. + --mtime int Object modified time (Unix Epoch in seconds). Defaults to current time --size int Object size in bytes ``` @@ -1937,7 +1937,8 @@ lakectl tag create [flags] #### Options ``` - -h, --help help for create + -f, --force override the tag if it exists + -h, --help help for create ``` diff --git a/docs/deploying-aws/monitor.md b/docs/reference/monitor.md similarity index 90% rename from docs/deploying-aws/monitor.md rename to docs/reference/monitor.md index 210b9b947d0..17c0dce94ea 100644 --- a/docs/deploying-aws/monitor.md +++ b/docs/reference/monitor.md @@ -2,9 +2,10 @@ layout: default title: Monitoring using Prometheus description: Users looking to monitor their lakeFS instances can point Prometheus configuration to scrape data from this endpoint. This guide explains how to setup -parent: Production Deployment +parent: Reference nav_order: 30 has_children: false +redirect_from: ../deploying-aws/monitor.md --- # Monitoring using Prometheus @@ -39,9 +40,9 @@ You can learn about these default metrics in this [post](https://povilasv.me/pro In addition, lakeFS exposes the following metrics to help monitor your deployment: | Name in Prometheus | Description | Labels -| api_requests_total | [lakeFS API](../reference/api.md) requests (counter)| **code**: http status
**method**: http method +| api_requests_total | [lakeFS API](api.md) requests (counter)| **code**: http status
**method**: http method | api_request_duration_seconds | Durations of lakeFS API requests (histogram)|
**operation**: name of API operation
**code**: http status -| gateway_request_duration_seconds | lakeFS [S3-compatible endpoint](../reference/s3.md) request (histogram)|
**operation**: name of gateway operation
**code**: http status +| gateway_request_duration_seconds | lakeFS [S3-compatible endpoint](s3.md) request (histogram)|
**operation**: name of gateway operation
**code**: http status | s3_operation_duration_seconds | Outgoing S3 operations (histogram)|
**operation**: operation name
**error**: "true" if error, "false" otherwise | gs_operation_duration_seconds | Outgoing Google Storage operations (histogram)|
**operation**: operation name
**error**: "true" if error, "false" otherwise | azure_operation_duration_seconds | Outgoing Azure storage operations (histogram)|
**operation**: operation name
**error**: "true" if error, "false" otherwise diff --git a/docs/deploying-aws/offboarding.md b/docs/reference/offboarding.md similarity index 83% rename from docs/deploying-aws/offboarding.md rename to docs/reference/offboarding.md index 6e3a4651b4c..52d4ee61ec8 100644 --- a/docs/deploying-aws/offboarding.md +++ b/docs/reference/offboarding.md @@ -2,9 +2,10 @@ layout: default title: Migrating away from lakeFS description: The simplest way to migrate away from lakeFS is to copy data from a lakeFS repository to an S3 bucket -parent: Production Deployment +parent: Reference nav_order: 40 has_children: false +redirect_from: ../deploying-aws/offboarding.html --- # Migrating away from lakeFS @@ -14,6 +15,6 @@ has_children: false The simplest way to migrate away from lakeFS is to copy data from a lakeFS repository to an S3 bucket (or any other object store). -For smaller repositories, this could be done using the [AWS cli](../using/aws_cli.md) or [rclone](../using/rclone.md). +For smaller repositories, this could be done using the [AWS cli](../integrations/aws_cli.md) or [rclone](../integrations/rclone.md). For larger repositories, running [distcp](https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html){: target="_blank"} with lakeFS as the source is also an option. diff --git a/docs/deploying-aws/upgrade.md b/docs/reference/upgrade.md similarity index 97% rename from docs/deploying-aws/upgrade.md rename to docs/reference/upgrade.md index 0896ac0375b..2aacd859b22 100644 --- a/docs/deploying-aws/upgrade.md +++ b/docs/reference/upgrade.md @@ -2,9 +2,10 @@ layout: default title: Upgrade lakeFS description: Upgrading lakeFS from a previous version usually just requires re-deploying with the latest image or downloading the latest version -parent: Production Deployment +parent: Reference nav_order: 50 has_children: false +redirect_from: ../deploying-aws/upgrade.html --- # Upgrading lakeFS diff --git a/docs/storage/blob.md b/docs/storage/blob.md new file mode 100644 index 00000000000..a5bb7692028 --- /dev/null +++ b/docs/storage/blob.md @@ -0,0 +1,30 @@ +--- +layout: default +title: Azure Blob Storage +description: This guide explains how to configure Azure Blob Storage as the underlying storage layer. +parent: Prepare Your Storage +nav_order: 30 +has_children: false +--- +# Prepare Your Blob Storage Container + +[Create a container in Azure portal](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal#create-a-container): + +1. From the Azure portal, Storage Accounts, choose your account, then in the container tab click `+ Container`. +1. Make sure you block public access + +## Authenticate with Secret Key +{: .no_toc } + +In case you want to use the secret key for authentication you will need to use the account key in the configuration +Go to the `Access Keys` tab and click on `Show Keys` save the values under `Storage account name` and `Key` we will need them in the [installing lakeFS](install.md) step +## Authenticate with Active Directory +{: .no_toc } + +In case you want your lakeFS Installation (we will install in the next step) to access this Container using Active Directory authentication, +First go to the container you created in step 1. +* Go to `Access Control (IAM)` +* Go to the `Role assignments` tab +* Add the `Storage Blob Data Contributor` role to the Installation running lakeFS. + +You can now proceed to [Installing lakeFS](../deploy/azure.md). diff --git a/docs/storage/gcs.md b/docs/storage/gcs.md new file mode 100644 index 00000000000..875f0541754 --- /dev/null +++ b/docs/storage/gcs.md @@ -0,0 +1,16 @@ +--- +layout: default +title: Google Cloud Storage +description: This guide explains how to configure Google Cloud Storage as the underlying storage layer. +parent: Prepare Your Storage +nav_order: 30 +has_children: false +--- + +# Prepare Your GCS Bucket + +1. On the Google Cloud Storage console, click *Create Bucket*. Follow the instructions. + +1. On the *Permissions* tab, add the service account you intend to use lakeFS with. Give it a role that allows reading and writing to the bucket, e.g. *Storage Object Creator*. + +You can now proceed to [Installing lakeFS](../deploy/gcp.md). diff --git a/docs/storage/index.md b/docs/storage/index.md new file mode 100644 index 00000000000..5827240027b --- /dev/null +++ b/docs/storage/index.md @@ -0,0 +1,17 @@ +--- +layout: default +title: Prepare Your Storage +description: A production installation of lakeFS will usually use your cloud provider's object storage as the underlying storage layer +nav_order: 8 +has_children: true +--- + +# Prepare Your Storage + +A production installation of lakeFS will usually use your cloud provider's object storage as the underlying storage layer. +You can choose to create a new bucket/container (recommended), or use an existing one with a path prefix. +The path under the existing bucket/container should be empty. + +Once you have a bucket/container configured, proceed to [Deploying lakeFS](../deploy/index.md). + +Choose your storage provider to configure your storage. diff --git a/docs/storage/s3.md b/docs/storage/s3.md new file mode 100644 index 00000000000..df4121e64ca --- /dev/null +++ b/docs/storage/s3.md @@ -0,0 +1,61 @@ +--- +layout: default +title: AWS S3 +description: This guide explains how to configure AWS S3 as the underlying storage layer. +parent: Prepare Your Storage +nav_order: 20 +has_children: false +redirect_from: + - ../deploying-aws/storage.html + - ../deploying-aws/bucket.html +--- + +# Prepare Your S3 Bucket + +1. From the S3 Administration console, choose `Create Bucket`. +2. Make sure you: + 1. Block public access + 2. Disable Object Locking +3. lakeFS requires permissions to interact with your bucket. Following is a minimal bucket policy. To add it, go to the `Permissions` tab, and paste it as : + + ```json + { + "Id": "Policy1590051531320", + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "Stmt1590051522178", + "Action": [ + "s3:GetObject", + "s3:GetObjectVersion", + "s3:PutObject", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts", + "s3:GetBucketVersioning", + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:ListBucketMultipartUploads", + "s3:ListBucketVersions" + ], + "Effect": "Allow", + "Resource": ["arn:aws:s3:::", "arn:aws:s3:::/*"], + "Principal": { + "AWS": ["arn:aws:iam:::role/"] + } + } + ] + } + ``` + + Replace ``, `` and `` with values relevant to your environment. + `IAM_ROLE` should be the role assumed by your lakeFS installation. + + Alternatively, if you use an AWS user's key-pair to authenticate lakeFS to AWS, change the policy's Principal to be the user: + + ```json + "Principal": { + "AWS": ["arn:aws:iam:::user/"] + } + ``` + +You can now proceed to [Installing lakeFS](../deploy/aws.md). diff --git a/docs/architecture/overview.md b/docs/understand/architecture.md similarity index 92% rename from docs/architecture/overview.md rename to docs/understand/architecture.md index 1921fbe8629..2451d06b8db 100644 --- a/docs/architecture/overview.md +++ b/docs/understand/architecture.md @@ -1,10 +1,13 @@ --- layout: default -title: Overview -parent: Architecture +title: Architecture +parent: Understanding lakeFS description: lakeFS architecture overview. Learn more about lakeFS components, including its S3 API gateway. -nav_order: 1 +nav_order: 10 has_children: false +redirect_from: + - architecture/index.html + - architecture/overview.html --- # Architecture Overview {: .no_toc } @@ -44,7 +47,7 @@ The Swagger ([OpenAPI](https://swagger.io/docs/specification/basic-structure/){: The S3 Storage Adapter is the component in charge of communication with the underlying S3 bucket. It is logically decoupled from the S3 Gateway to allow for future compatibility with other types of underlying storage such as HDFS or S3-Compatible storage providers. -See the [roadmap](../roadmap.md) for information on future plans for storage compatibility. +See the [roadmap](roadmap.md) for information on future plans for storage compatibility. ### Metadata Index diff --git a/docs/architecture/data-model.md b/docs/understand/data-model.md similarity index 92% rename from docs/architecture/data-model.md rename to docs/understand/data-model.md index c402c5fe2c8..0d99f3dc854 100644 --- a/docs/architecture/data-model.md +++ b/docs/understand/data-model.md @@ -1,10 +1,11 @@ --- layout: default title: Data Model -parent: Architecture -description: lakeFS Data Model explained -nav_order: 2 +parent: Understanding lakeFS +description: This page explains the lakeFS Data Model +nav_order: 20 has_children: false +redirect_from: ../architecture/data-model.html --- # Data Model {: .no_toc } @@ -99,7 +100,7 @@ Luckily, this is also much smaller data, compared to the committed dataset. References and uncommitted data are currently stored on PostgreSQL for its strong consistency and transactional guarantees. -[In the future](../roadmap.md#lakefs-on-the-rocks-milestone-3---remove-postgresql) we plan on eliminating the need for an RDBMS by embedding [Raft](https://raft.github.io/){: target="_blank" } to replicate these writes across a cluster of machines, with the data itself being stored in RocksDB. To make operations easier, the replicated RocksDB database will be periodically snapshotted to the underlying object store. +[In the future](roadmap.md#lakefs-on-the-rocks-milestone-3---remove-postgresql) we plan on eliminating the need for an RDBMS by embedding [Raft](https://raft.github.io/){: target="_blank" } to replicate these writes across a cluster of machines, with the data itself being stored in RocksDB. To make operations easier, the replicated RocksDB database will be periodically snapshotted to the underlying object store. For extremely large installations ( >= millions of read/write operations per second), it will be possible to utilize [multi-Raft](https://pingcap.com/blog/2017-08-15-multi-raft/){: target="_blank" } to shard references across a wider fleet of machines. diff --git a/docs/understand/index.md b/docs/understand/index.md new file mode 100644 index 00000000000..aadbdab99ab --- /dev/null +++ b/docs/understand/index.md @@ -0,0 +1,9 @@ +--- +layout: default +title: Understanding lakeFS +description: This section includes all the details about the lakeFS open source project. +nav_order: 50 +has_children: true +--- + +This section includes all the details about the lakeFS open source project. diff --git a/docs/licensing.md b/docs/understand/licensing.md similarity index 95% rename from docs/licensing.md rename to docs/understand/licensing.md index 7c91f00bb51..5bce175c5bd 100644 --- a/docs/licensing.md +++ b/docs/understand/licensing.md @@ -1,9 +1,11 @@ --- layout: default title: Licensing -description: lakeFS is an open source project under the Apache 2.0 license. As a commercial organization, we intend to use an open core model. -nav_order: 65 +parent: Understanding lakeFS +description: lakeFS is an open source project under the Apache 2.0 license. +nav_order: 50 has_children: false +redirect_from: ../licensing.html --- # Licensing @@ -19,7 +21,7 @@ We believe small organizations should be able to use cutting edge technologies f As a commercial organization, we intend to use an open core model. -![Open Core Model](assets/img/open_core.png) +![Open Core Model](../assets/img/open_core.png) ***What is our commitment to open source?*** diff --git a/docs/roadmap.md b/docs/understand/roadmap.md similarity index 96% rename from docs/roadmap.md rename to docs/understand/roadmap.md index 68b9052f6ef..843292f6928 100644 --- a/docs/roadmap.md +++ b/docs/understand/roadmap.md @@ -1,9 +1,11 @@ --- layout: default title: Roadmap +parent: Understanding lakeFS description: New features and improvements are lined up next for lakeFS. We would love you to be part of building lakeFS’s roadmap. -nav_order: 45 +nav_order: 40 has_children: false +redirect_from: ../roadmap.html --- # Roadmap @@ -93,7 +95,7 @@ A way to ensure certain branches (i.e. main) are only merged to, and are not bei main ensures schema never breaks and all partitions are complete and tested) ### Webhook Support integration: Metastore registration -Using webhooks, we can automatically register or update collections in a Hive/Glue metastore, using [Symlink Generation](https://docs.lakefs.io/using/glue_hive_metastore.html#create-symlink), this will also allow systems that don’t natively integrate with lakeFS to consume data produced using lakeFS. +Using webhooks, we can automatically register or update collections in a Hive/Glue metastore, using [Symlink Generation](../integrations/glue_hive_metastore.md#create-symlink), this will also allow systems that don’t natively integrate with lakeFS to consume data produced using lakeFS. ### Webhook Support integration: Metadata validation Provide a basic wrapper around something like [pyArrow](https://pypi.org/project/pyarrow/) that validates Parquet or ORC files for common schema problems such as backwards incompatibility. diff --git a/docs/architecture/sizing-guide.md b/docs/understand/sizing-guide.md similarity index 97% rename from docs/architecture/sizing-guide.md rename to docs/understand/sizing-guide.md index 5e4832841af..883b4de9276 100644 --- a/docs/architecture/sizing-guide.md +++ b/docs/understand/sizing-guide.md @@ -1,10 +1,12 @@ --- layout: default title: Sizing Guide -parent: Architecture -description: Sizing guide for deploying lakeFS -nav_order: 3 +parent: Understanding lakeFS +description: This page provides a detailed sizing guide for deploying lakeFS +nav_order: 30 has_children: false +redirect_from: + - ../architecture/sizing-guide.html --- # Sizing guide {: .no_toc } @@ -29,7 +31,7 @@ For high throughput, additional CPUs help scale requests across different cores. "Expensive" operations such as large diff or commit operations can take advantage of multiple cores. ### Network -If using the data APIs such as the [S3 Gateway](overview.md#s3-gateway), +If using the data APIs such as the [S3 Gateway](architecture.md#s3-gateway), lakeFS will require enough network bandwidth to support the planned concurrent network upload/download operations. For most cloud providers, more powerful machines (i.e. more expensive and usually with more CPU cores) also provide increased network bandwidth. @@ -282,7 +284,7 @@ Here are a few notable metrics to keep track of when sizing lakeFS: `api_request_duration_seconds` - Histogram of latency per operation type -`gateway_request_duration_seconds` - Histogram of latency per [S3 Gateway](overview.md#s3-gateway) operation +`gateway_request_duration_seconds` - Histogram of latency per [S3 Gateway](architecture.md#s3-gateway) operation `go_sql_stats_*` - Important client-side metrics collected from the PostgreSQL driver. See [The full reference here](https://github.com/dlmiddlecote/sqlstats#exposed-metrics){: target="_blank" }. @@ -299,7 +301,7 @@ Data being managed by lakeFS is both structured, tabular data; as well as unstructured sensor and image data used for training. Assuming a team of 20-50 researchers, with a dataset size of 500 TiB across 20M objects. -**Environment:** lakeFS will be deployed on [Kubernetes](../deploying-aws/install.md#kubernetes-with-helm) +**Environment:** lakeFS will be deployed on [Kubernetes](../deploy/k8s.md) managed by [AWS EKS](https://aws.amazon.com/eks/){: target="_blank" } with PostgreSQL on [AWS RDS Aurora](https://aws.amazon.com/rds/aurora/postgresql-features/){: target="_blank" } @@ -328,7 +330,7 @@ Airflow DAGs start by creating a branch for isolation and for CI/CD. Data being managed by lakeFS is structured, tabular data. Total dataset size is 10 PiB, spanning across 500M objects. Expected throughput is 10k reads/second + 2k writes per second across 100 concurrent branches. -**Environment:** lakeFS will be deployed on [Kubernetes](../deploying-aws/install.md#kubernetes-with-helm) +**Environment:** lakeFS will be deployed on [Kubernetes](../deploy/k8s.md) managed by [AWS EKS](https://aws.amazon.com/eks/){: target="_blank" } with PostgreSQL on [AWS RDS](https://aws.amazon.com/rds/aurora/postgresql-features/){: target="_blank" } diff --git a/docs/usecases/cd.md b/docs/usecases/cd.md new file mode 100644 index 00000000000..aa6479fe742 --- /dev/null +++ b/docs/usecases/cd.md @@ -0,0 +1,57 @@ +--- +layout: default +title: Continuous Data Deployment +parent: Example Use-Cases +description: lakeFS helps you continuously validate expectations and assumptions from the data itself. +nav_order: 45 +--- + +## Continuous Deployment +Not every day we introduce new data to the lake, or add/change ETLs, but we do have recurring jobs that are running, and updates to our existing data collections. Even if the code and infra didn't change, the data might, and those changes introduce quality issues. This is one of the complexities of a data product, the data we consume changes over the course of a month, a week, or even a single day. + +**Examples of changes to data that may occur:** + - A client-side bug in the data collection of website events + - A new Android version that interferes with the collecting events from your App + - COVID-19 abrupt impact on consumers' behavior, and its effect on the accuracy of ML models. + - During a change to Salesforce interface, the validation requirement from a certain field had been lost + +lakeFS helps you validate your expectations and assumptions from the data itself. + + +### Example 1: Pre merge hook - a data quality issue + +Continuous deployment of existing data we expect to consume, flowing from our ingest-pipelines into the lake. Similar to the Continuous Integration use-case - we create a ingest branch (“events-data”), which allows us to create tests using data analysis tools or data quality services (e.g. [Great Expectations](https://greatexpectations.io/){: target="_blank" }, [Monte Carlo](https://www.montecarlodata.com/){: target="_blank" }) to ensure reliability of the data we merge to the main branch. Since merge is atomic, no performance issue will be introduced by using lakeFS, but your main branch will only include quality data. + +branching_6 + +### Example 2: RollBack! - Data ingested from a Kafka stream + +If you introduce a new code version to production and discover it has a critical bug, you can simply roll back to the previous version. But you also need to roll back the results of running it. lakeFS gives you the power to rollback your data if you introduced low quality data. The rollback is an atomic action that prevents the data consumers from receiving low quality data until the issue is resolved. + +As previously mentioned, with lakeFS the recommended branching schema is to ingest data to a dedicated branch. When streaming data, we can decide to merge the incoming data to main at a given time interval or checkpoint, depending on how we chose to write it from Kafka. + +You can run quality tests for each merge (as presented in Example 1). Alas, tests are not perfect and we might still introduce low quality data at some point. In such a case, we can rollback main to the last known high quality commit, since our commits for streaming will include the metadata of the Kafka offset. + +branching_7 + +_Rolling back a branch to a previous commit using the CLI_ + + ```shell + lakectl branch reset lakefs://example-repo/stream-1 --commit ~79RU9aUsQ9GLnU + ``` + +**Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch. + +### Example 3: Cross collection consistency + +We often need consistency between different data collections. A few examples may be: + - To join different collections in order to create a unified view of an account, a user or another entity we measure. + - To introduce the same data in different formats + - To introduce the same data with a different leading index or sorting due to performance considerations + +lakeFS will help ensure you introduce only consistent data to your consumers by exposing the new collections and their join in one atomic action to main. Once you consumed the collections on a different branch, and only when both are synchronized, we calculated the join and merged to main. + +In this example you can see two data sets (Sales data and Marketing data) consumed each to its own independent branch, and after the write of both data sets is completed, they are merged to a different branch (leads branch) where the join ETL runs and creates a joined collection by account. The joined table is then merged to main. +The same logic can apply if the data is ingested in streaming, using standard formats, or formats that allow upsert/delete such as Apache Hudi, Delta Lake or Iceberg. + +branching_8 diff --git a/docs/usecases/ci.md b/docs/usecases/ci.md new file mode 100644 index 00000000000..f444d4c8145 --- /dev/null +++ b/docs/usecases/ci.md @@ -0,0 +1,27 @@ +--- +layout: default +title: Continuous Data Integration +parent: Example Use-Cases +description: lakeFS enables to continuously test newly ingested data to ensure data quality requirements are met +nav_order: 35 +--- + +## Continuous Data Integration + +Everyday data lake management includes ingestion of new data collections, and a growing number of consumers reading and writing analysis results to the lake. In order to ensure our lake is reliable we need to validate new data sources, enforce good practices to maintain a clean lake (avoid the swamp) and validate metadata. lakeFS simplifies continuous integration of data to the lake by supporting ingestion on a designated branch. Merging data to main is enabled only if conditions apply. To make this tenable, let’s look at a few examples: + +### Example 1: Pre-merge hooks - enforce best practices + +Examples of good practices enforced in organizations: + + - No user_* columns except under /private/... + - Only `(*.parquet | *.orc | _delta_log/*.json)` files allowed + - Under /production, only backward-compatible schema changes are allowed + - New tables on main must be registered in our metadata repository first, with owner and SLA + +lakeFS will assist in enforcing best practices by giving you a designated branch to ingest new data (“new-data-1” in the drawing). . You may run automated tests to validate predefined best practices as pre-merge hooks. If the validation passes, the new data will be automatically and atomically merged to the main branch. However, if the validation fails, you will be alerted, and the new data will not be exposed to consumers. + +By using this branching model and implementing best practices as pre merge hooks, you ensure the main lake is never compromised. + +branching_4 + diff --git a/docs/usecases/data-devenv.md b/docs/usecases/data-devenv.md new file mode 100644 index 00000000000..371c3d9e4c6 --- /dev/null +++ b/docs/usecases/data-devenv.md @@ -0,0 +1,85 @@ +--- +layout: default +title: Data Development Environment +parent: Example Use-Cases +description: lakeFS enables a safe development environment on your data lake without the need to copy or mock data +nav_order: 25 +--- + + +## Data Development Environment + +As part of our routine work with data we develop new code, improve and upgrade old code, upgrade infrastructures, and test new technologies. lakeFS enables a safe development environment on your data lake without the need to copy or mock data, work on the pipelines or involve DevOps. + +Creating a branch provides you an isolated environment with a snapshot of your repository (any part of your data lake you chose to manage on lakeFS). While working on your own branch in isolation, all other data users will be looking at the repository’s main branch. They can't see your changes, and you don’t see changes to main done after you created the branch. +No worries, no data duplication is done, it’s all metadata management behind the scenes. +Let’s look at 3 examples of a development environment and their branching models. + +### Example 1: Upgrading Spark and using Reset action + +You installed the latest version of Apache Spark. As a first step you’ll test your Spark jobs to see that the upgrade doesn't have any undesired side effects. + +For this purpose, you may create a branch (testing-spark-3.0) which will only be used to test the Spark upgrade, and discarded later. Jobs may run smoothly (the theoretical possibility exists!), or they may fail halfway through, leaving you with some intermediate partitions, data and metadata. In this case, you can simply *reset* the branch to its original state, without worrying about the intermediate results of your last experiment, and perform another (hopefully successful) test in an isolated branch. Reset actions are atomic and immediate, so no manual cleanup is required. + +Once testing is completed, and you have achieved the desired result, you can delete this experimental branch, and all data not used on any other branch will be deleted with it. + +branching_1 + +_Creating a testing branch:_ + + ```shell + lakectl branch create \ + lakefs://example-repo/testing-spark-3 \ + --source lakefs://example-repo/main + # output: + # created branch 'testing-spark-3', pointing to commit ID: '~79RU9aUsQ9GLnU' + ``` + +_Resetting changes to a branch:_ + + ```shell + lakectl branch reset lakefs://example-repo/testing-spark-3 + # are you sure you want to reset all uncommitted changes?: y█ + ``` + +**Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch. + +### Example 2: Compare - Which option is better? + +Easily compare by testing which one performs better on your data set. +Examples may be: +* Different computation tools, e.g Spark vs. Presto +* Different compression algorithms +* Different Spark configurations +* Different code versions of an ETL + +Run each experiment on its own independent branch, while the main remains untouched. Once both experiments are done, create a comparison query (using hive or presto or any other tool of your choice) to compare data characteristics, performance or any other metric you see fit. + +With lakeFS you don't need to worry about creating data paths for the experiments, copying data, and remembering to delete it. It’s substantially easier to avoid errors and maintain a clean lake after. + +branching_2 + +_Reading from and comparing branches using Spark:_ + + ```scala + val dfExperiment1 = sc.read.parquet("s3a://example-repo/experiment-1/events/by-date") + val dfExperiment2 = sc.read.parquet("s3a://example-repo/experiment-2/events/by-date") + + dfExperiment1.groupBy("...").count() + dfExperiment2.groupBy("...").count() // now we can compare the properties of the data itself + ``` + +### Example 3: Reproduce - A bug in production + +You upgraded spark and deployed changes in production. A few days or weeks later, you identify a data quality issue, a performance degradation, or an increase to your infra costs. Something that requires investigation and fixing (aka, a bug). + +lakeFS allows you to open a branch of your lake from the specific merge/commit that introduced the changes to production. Using the metadata saved on the merge/commit you can reproduce all aspects of the environment, then reproduce the issue on the branch and debug it. Meanwhile, you can revert the main to a previous point in time, or keep it as is, depending on the use case + +branching_3 + + +_Reading from a historic version (a previous commit) using Spark_ + + ```scala + // represents the data as existed at commit "~79RU9aUsQ9GLnU": + spark.read.parquet("s3://example-repo/~79RU9aUsQ9GLnU/events/by-date") diff --git a/docs/usecases/index.md b/docs/usecases/index.md new file mode 100644 index 00000000000..b3becbc751c --- /dev/null +++ b/docs/usecases/index.md @@ -0,0 +1,9 @@ +--- +layout: default +title: Example Use-Cases +description: Explore example of how other companies are using lakeFS for safe experimentation and CI/CD for data. +nav_order: 25 +has_children: true +redirect_from: + - ../branching/recommendations.html +--- diff --git a/docs/using/index.md b/docs/using/index.md deleted file mode 100644 index e8414fed446..00000000000 --- a/docs/using/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Using lakeFS with... -description: You can use lakeFS with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, etc. -nav_order: 35 -has_children: true ----