Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Self-Managed Enterprise scaling & troubleshooting #38863

Merged
merged 1 commit into from
Jun 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions docs/enterprise-setup/implementation-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,11 @@ import TabItem from '@theme/TabItem';

[Airbyte Self-Managed Enterprise](./README.md) is in an early access stage for select priority users. Once you [are qualified for a Self-Managed Enterprise license key](https://airbyte.com/company/talk-to-sales), you can deploy Airbyte with the following instructions.

Airbyte Self-Managed Enterprise must be deployed using Kubernetes. This is to enable Airbyte's best performance and scale. The core components \(api server, scheduler, etc\) run as deployments while the scheduler launches connector-related pods on different nodes.
Airbyte Self-Managed Enterprise must be deployed using Kubernetes. This is to enable Airbyte's best performance and scale. The core Airbyte components (`server`, `webapp`, `workload-launcher`) run as deployments. The `workload-launcher` is responsible for managing connector-related pods (`check`, `discover`, `read`, `write`, `orchestrator`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to expand on the first sentence and hint at more specific requirements listed later? Something like "Airbyte Self-Managed Enterprise must be deployed on a Kubernetes cluster managed with Helm. This is to enable Airbyte's best performance and scale. We support enterprise deployments on AWS, GCP, or Azure using services outlined in this document."


## Prerequisites

### Infrastructure Prerequisites

For a production-ready deployment of Self-Managed Enterprise, various infrastructure components are required. We recommend deploying to Amazon EKS or Google Kubernetes Engine. The following diagram illustrates a typical Airbyte deployment running on AWS:

![AWS Architecture Diagram](./assets/self-managed-enterprise-aws.png)
Expand All @@ -23,12 +22,19 @@ Prior to deploying Self-Managed Enterprise, we recommend having each of the foll

| Component | Recommendation |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Kubernetes Cluster | Amazon EKS cluster running in [2 or more availability zones](https://docs.aws.amazon.com/eks/latest/userguide/disaster-recovery-resiliency.html) on a minimum of 6 nodes. |
| Kubernetes Cluster | Amazon EKS cluster running on EC2 instances in [2 or more availability zones](https://docs.aws.amazon.com/eks/latest/userguide/disaster-recovery-resiliency.html) on a minimum of 6 nodes. |
| Ingress | [Amazon ALB](#configuring-ingress) and a URL for users to access the Airbyte UI or make API requests. |
| Object Storage | [Amazon S3 bucket](#configuring-external-logging) with two directories for log and state storage. |
| Dedicated Database | [Amazon RDS Postgres](#configuring-the-airbyte-database) with at least one read replica. |
| External Secrets Manager | [Amazon Secrets Manager](/operator-guides/configuring-airbyte#secrets) for storing connector secrets. |


A few notes on Kubernetes cluster provisioning for Airbyte Self-Managed Enterprise:
* We support Amazon Elastic Kubernetes Service (EKS) on EC2 or Google Kubernetes Engine (GKE) on Google Compute Engine (GCE). Improved support for Azure Kubernetes Service (AKS) is coming soon.
* We recommend running Airbyte on memory-optimized instances, such as M7i / M7g instance types.
* While we support GKE Autopilot, we do not support Amazon EKS on Fargate.
* We recommend running Airbyte on instances with at least 2 cores and 8 gigabytes of RAM.

We require you to install and configure the following Kubernetes tooling:

1. Install `helm` by following [these instructions](https://helm.sh/docs/intro/install/)
Expand Down
135 changes: 135 additions & 0 deletions docs/enterprise-setup/scaling-airbyte.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
products: oss-enterprise
---

# Scaling Airbyte After Installation

Once you've completed the initial installation of Airbyte Self-Managed Enterprise, the next crucial step is scaling your setup as needed to ensure optimal performance and reliability as your data integration needs grow. This guide will walk you through best practices and strategies for scaling Airbyte in an enterprise environment.

## Concurrent Syncs

The primary indicator of increased resource usage in Airbyte is the number of concurrent syncs running at any given time. Each concurrent sync requires at least 3 additional connector pods to be running at once (`orchestrator`, `read`, `write`). This means that 10 concurrent syncs requires 30 additional pods in your namespace. Connector pods last only for the duration of a sync, and will be appended by the ID of the ongoing job.

If your deployment of Airbyte is intended to run many concurrent syncs at once (e.g. an overnight backfill), we recommend provisioning an increased number of instances. Some connectors are memory and CPU intensive, while others are not. Using an infrastructure monitoring tool, we recommend measuring the following at all times:
* Requested CPU %
* CPU Usage %
* Requested Memory %
* Memory Usage %

With high CPU or Memory usage, we recommend scaling up your Airbyte deployment to a higher number of nodes, or reducing the maximum resource usage by any given connector pod. If high _requested_ CPU or memory usage is blocking new pods from being scheduled, while _used_ CPU or memory is low, you may modify connector pod provisioning defaults in your `values.yml` file:

```yaml
global:
edition: "enterprise"
...
jobs:
resources:
limits:
cpu:
memory:
requests:
cpu:
memory:
```

If your Airbyte deployment is underprovisioned, you may often notice occasional 'stuck jobs' that remain in-progress for long periods, with eventual failures related to unavailable pods. If you begin to see such errors, we recommend you follow the troubleshooting steps above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If your Airbyte deployment is underprovisioned, you may often notice occasional 'stuck jobs' that remain in-progress for long periods, with eventual failures related to unavailable pods. If you begin to see such errors, we recommend you follow the troubleshooting steps above.
If your Airbyte deployment is underprovisioned, you may notice occasional 'stuck jobs' that remain in-progress for long periods, with eventual failures related to unavailable pods. If you begin to see such errors, we recommend you follow the troubleshooting steps above.


## Multiple Node Groups

To reduce the blast radius of an underprovisioned Airbyte deployment, we recommend placing 'static' workloads (`webapp`, `server`, etc.) on one Kubernetes node group, while placing job-related workloads (connector pods) on a different Kubernetes node group. This ensures that UI or API availability is unlikely to be impacted by the number of concurrent syncs.

<details>
<summary>Configure Airbyte Self-Managed Enterprise to run in two node groups</summary>

```yaml
airbyte-bootloader:
nodeSelector:
type: static

server:
nodeSelector:
type: static

keycloak:
nodeSelector:
type: static

keycloak-setup:
nodeSelector:
type: static

temporal:
nodeSelector:
type: static

webapp:
nodeSelector:
type: static

worker:
nodeSelector:
type: jobs

workload-launcher:
nodeSelector:
type: static
## Pods spun up by the workload launcher will run in the 'jobs' node group.
extraEnvs:
- name: JOB_KUBE_NODE_SELECTORS
value: type=jobs
- name: SPEC_JOB_KUBE_NODE_SELECTORS
value: type=jobs
- name: CHECK_JOB_KUBE_NODE_SELECTORS
value: type=jobs
- name: DISCOVER_JOB_KUBE_NODE_SELECTORS
value: type=jobs

orchestrator:
nodeSelector:
type: jobs

workload-api-server:
nodeSelector:
type: jobs
```

</details>

## High Availability

You may wish to implement high availability (HA) to minimize downtime and ensure continuous data integration processes. Please note that this requires provisioning Airbyte on a larger number of Nodes, which may increase your licensing fees. For a typical HA deployment, you will want a VPC with subnets in at least two (and preferably three) availability zones (AZs).

We particularly recommend having multiple instances of `worker` and `server` pods:

```yaml
worker:
replicaCount: 2

server:
replicaCount: 2
```

Furthermore, you may want to implement a primary-replica setup for the database (e.g., PostgreSQL) used by Airbyte. The primary database handles write operations, while replicas handle read operations, ensuring data availability even if the primary fails.

## Disaster Recovery (DR) Regions

For business-critical applications of Airbyte, you may want to configure a Disaster Recovery (DR) cluster for Airbyte. We do not support assisting customers with DR deployments at this time. However, we offer a few high level suggestions:
1. We strongly recommend configuring an external database, external log storage and external connector secret management.
2. We strongly recommend that your DR cluster is also an instance of Self-Managed Enterprise, kept at the same version as your prod instance.

## DEBUG Logs

We recommend turning off `DEBUG` logs for any non-testing use of Self-Managed Airbyte. Failing to do while running at-scale syncs may result in the `server` pod being overloaded, preventing most of the deployment for operating as normal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We recommend turning off `DEBUG` logs for any non-testing use of Self-Managed Airbyte. Failing to do while running at-scale syncs may result in the `server` pod being overloaded, preventing most of the deployment for operating as normal.
We recommend turning off `DEBUG` logs for any non-testing use of Self-Managed Airbyte. Failing to do so while running at-scale syncs may result in the `server` pod being overloaded. This would prevent most of the deployment from operating as normal.


## Schema Discovery Timeouts

While configuring a database source connector with hundreds to thousands of tables, each with many columns, the one-time `discover` mechanism - by which we discover the topology of your source - may run for a long time and exceed Airbyte's timeout duration. Should this be the case, you may increase Airbyte's timeout limit as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While configuring a database source connector with hundreds to thousands of tables, each with many columns, the one-time `discover` mechanism - by which we discover the topology of your source - may run for a long time and exceed Airbyte's timeout duration. Should this be the case, you may increase Airbyte's timeout limit as follows:
Airbyte uses a one-time `discover` mechanism to map out the topology of your source. If a database source connector has hundreds or even thousands of tables, each with many columns, `discover` may run for long enough that it exceeds Airbyte's timeout duration. In such a case, you may increase Airbyte's timeout limit as follows:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful to link to the Airbyte protocol definition for "Discover" as well: understanding-airbyte/airbyte-protocol#discover


```yaml
server:
extraEnvs:
- name: HTTP_IDLE_TIMEOUT
value: 20m
- name: READ_TIMEOUT
value: 30m
```
1 change: 1 addition & 0 deletions docusaurus/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -549,6 +549,7 @@ module.exports = {
items: [
"enterprise-setup/implementation-guide",
"enterprise-setup/api-access-config",
"enterprise-setup/scaling-airbyte",
"enterprise-setup/upgrading-from-community",
],
},
Expand Down
Loading