diff --git a/deployment/arroyo-cluster.mdx b/deployment/ec2.mdx similarity index 60% rename from deployment/arroyo-cluster.mdx rename to deployment/ec2.mdx index 254a2e5..994af33 100644 --- a/deployment/arroyo-cluster.mdx +++ b/deployment/ec2.mdx @@ -1,56 +1,47 @@ --- -title: Arroyo Cluster -description: "Running a distributed Arroyo cluster" +title: Deploying to EC2 +description: "Setting up an Arroyo cluster on EC2" --- -While the single-node Arroyo cluster is useful for testing and development, it is not suitable for production. This -page describes how to run a production-ready distributed Arroyo cluster using either Arroyo's built-in scheduler or -[nomad](https://www.nomadproject.io/). +This document will cover how to run an Arroyo cluster on raw EC2 instances. This requires a good understanding of the +Arroyo architecture. For an easier approach to running a production-quality Arroyo cluster, see the docs for running +on top of [nomad](/deployment/nomad). Kubernetes support is also coming soon. -Before attempting to run a cluster, you should familiarize yourself with the [Arroyo architecture](/architecture). We -are also happy to support users rolling out their own clusters, so please reach out to us at support@arroyo.systems with -any questions. +Before starting this guide, follow the common setup steps in the [deployment overview](/deployment/overview) guide. -You will also need to set up a dev environment, as we do not yet distribute binaries. See the [dev setup](/dev-setup) -instructions. +We don't currently distribute binaries for Arroyo, so you will need to build the binaries yourself. Follow the +[dev setup](/developing/dev-setup) guide to learn how. -## Common Setup +## Running the migrations -### Postgres - -Arroyo relies on a postgres database to store configuration data and metadata. You will need to create a database -(by default called `arroyo`, but this can be configured) and run the migrations to set it up. +As covered in the dev setup, you will need to run the database migrations on your prod database before starting the +services. We use [refinery](https://github.com/rust-db/refinery) to manage migrations. To run the migrations on your database, run these commands from your checkout of arroyo: ```bash $ cargo install refinery_cli -$ refinery setup # follow the directions +$ refinery setup # follow the directions, configuring for your prod database $ refinery migrate -p arroyo-api/migrations ``` -### S3 - -You will need to create a S3 bucket (or an equivalent service that exposes an S3-compatible API) to store checkpoints. -This will need to be writable by the nodes that are running the Arroyo controller and workers. - ## Running the services -There are two options for running aa distributed cluster. You can either use Arroyo's built-in scheduler and nodes, or +There are two options for running a distributed cluster. You can either use Arroyo's built-in scheduler and nodes, or you can use [nomad](https://www.nomadproject.io/). Nomad is currently the recommended option, for production usecases. The Arroyo services can be run via Nomad, or separately on VMs. ### Arroyo Services An Arroyo cluster consists of more or more arroyo-api process and a single arroyo-controller process. This can be run -however you would like, and may be run on a single machine or on multiple machines. +however you like, and may be run on a single machine or on multiple machines. To achieve high-availability on the API +layer, you will need to run multipe instances behind a load balancer (such as an ALB). The arroyo-api server exposes a gRPC API on port 8001 by default, and serves static HTML and JS for the web UI on port -8000. These can be put behind a load balancer (such as an ALB) for high availability. If the API and controller are not -running on the same machine, the API needs to be configured with the endpoint of the controller's gRPC API via the -`CONTROLLER_ADDR` environment variable. By default, the controller runs its gRPC API on port 9190. If the controller's -hostname is `arroyo-controller.int` then the API would be configured with +8000. If the API and controller are not running on the same machine, the API needs to be configured with the endpoint of +the controller's gRPC API via the `CONTROLLER_ADDR` environment variable. By default, the controller runs its gRPC API +on port 9190. If the controller's hostname is `arroyo-controller.int` then the API would be configured with `CONTROLLER_ADDR=http://arroyo-controller.int:9190`. Both arroyo-api and arroyo-controller additionally need to be configured with the database connection information via diff --git a/deployment/kubernetes.mdx b/deployment/kubernetes.mdx new file mode 100644 index 0000000..ee1d04d --- /dev/null +++ b/deployment/kubernetes.mdx @@ -0,0 +1,6 @@ +--- +title: Deploying to Kubernetes +description: "Running an Arroyo cluster on Kubernetes" +--- + +Coming soon. diff --git a/deployment/nomad.mdx b/deployment/nomad.mdx new file mode 100644 index 0000000..8239dc6 --- /dev/null +++ b/deployment/nomad.mdx @@ -0,0 +1,93 @@ +--- +title: Deploying to Nomad +description: "Running an Arroyo cluster on Nomad" +--- + +Arroyo supports Nomad as both a _scheduler_ (for running Arroyo pipeline tasks) and as as a deploy target for the Arroyo +control plane. This is currently the easiest way to get a production quality Arroyo cluster running. + +Before starting this guide, follow the common setup steps in the [deployment overview](/deployment/overview) guide. + +This guide assumes a working Nomad cluster. It has been tested with Nomad >= 1.4, but should work with 1.3 as well. See +the [Nomad documentation](https://www.nomadproject.io/docs) for more information. + +Note that all of the components of Arroyo (controller, compiler, and workers) need to be able to access S3. You will +need to ensure that the Nomad cluster has access to the S3 bucket you will be using. + +## Install nomad pack + +For ease of installation, we distribute a nomad pack that can be used to install Arroyo on Nomad. To use the pack, you +will first need to install nomad-pack. Follow the documentation +[here](https://developer.hashicorp.com/nomad/tutorials/nomad-pack/nomad-pack-intro). + +Once `nomad-pack` is available on your machine you are ready to proceed. + +## Add the Arroyo registry + +The Arroyo pack is available in the [Arroyo registry](https://github.com/ArroyoSystems/arroyo-nomad-pack) + +To add the registry, run the following command: + +```bash +$ nomad-pack registry add arroyo \ + https://github.com/ArroyoSystems/arroyo-nomad-pack.git +``` + +## Configuring the pack + +There are a number of variables that can be configured to customize the Arroyo deployment: + +| Variable | Description | +| --- | --- | +| `job_name` | The name of Nomad job for the Arroyo cluster | +| `region` | The region where jobs will be deployed | +| `datacenters` | A list of datacenters in the region which are eligible for task placement | +| `prometheus_endpoint` | Endpoint for prometheus with protocol, required for job metrics (for example `http://prometheus.service:9090`) | +| `prometheus_auth` | Basic authentication for prometheus if required | +| `postgres_host` | Host of your postgres database | +| `postgres_port` | Port of your postgres database | +| `postgres_db` | Name of your postgres database | +| `postgres_user` | User of your postgres database | +| `postgres_password` | Password of your postgres database | +| `s3_bucket` | S3 bucket to store checkpoints and pipeline artifacts | +| `s3_region` | Region for the s3 bucket | +| `nomad_api` | Nomad API endpoint with protocol (for example `http://nomad.service:4646`) | +| `compiler_resources` | Controls the CPU and memory to use for the compiler; at least 2 GB of memory is required | +| `controller_resources` | The resources for the controller and API | + +Of these, at least the postgres configuration and the s3 bucket configuration are required. + +## Deploying the Arroyo pack + +Now we're ready to actually deploy our Arroyo cluster! Here's an example command line: + +```bash +$ nomad-pack run arroyo --registry=arroyo \ + --var arroyo.postgres_db=arroyo \ + --var arroyo.postgres_host=postgres-host.cluster \ + --var arroyo.postgres_user=arroyodb \ + --var arroyo.postgres_password=arroyodb \ + --var arroyo.datacenters='["us-east-1"]' \ + --var arroyo.s3_bucket=arroyo-prod \ + --var arroyo.prometheus_endpoint="http://prometheus.cluster:9090" +``` + +You will need to adjust the variables as appropriate for your environment. + +## Accesing the Arroyo API + +Once the pack has been deployed, you can access the Arroyo UI by visiting the address of the `api-http` service. By +default, this has a dynamic port. + +To find the endpoint and port, run the following command: + +```bash +$ nomad service info api-http +``` + +Visit the address in your browser to access the Arroyo UI. + +## Having trouble? + +Reach out to us at support@arroyo.systems or on our [Discord](https://discord.gg/cjCr5rVmyR) if you have any questions +or issues. diff --git a/deployment/overview.mdx b/deployment/overview.mdx new file mode 100644 index 0000000..d384475 --- /dev/null +++ b/deployment/overview.mdx @@ -0,0 +1,29 @@ +--- +title: Overview +description: "Running a distributed Arroyo cluster" +--- + +While the single-node Arroyo cluster is useful for testing and development, it is not suitable for production. This +page describes how to run a production-ready distributed Arroyo cluster using either Arroyo's built-in scheduler or +[nomad](https://www.nomadproject.io/). + +Before attempting to run a cluster, you should familiarize yourself with the [Arroyo architecture](/architecture). We +are also happy to support users rolling out their own clusters, so please reach out to us at support@arroyo.systems or +on discord with any questions. + +## Common Setup + +### Postgres + +Arroyo relies on a postgres database to store configuration data and metadata. You will need to create a database +(by default called `arroyo`, but this can be configured). + +### S3 + +You will need to create a S3 bucket to store checkpoints and artifacts. This will need to be writable by the nodes +that are running the Arroyo controller and workers. + +### Prometheus + +The Arroyo Web UI can show job metrics to help monitor job progress. To enable this, you will need to set up a Prometheus +server. See the [prometheus documentation](https://prometheus.io/docs/introduction/overview/) for more details. diff --git a/mint.json b/mint.json index fd46acb..4e01915 100644 --- a/mint.json +++ b/mint.json @@ -51,7 +51,10 @@ { "group": "Deployment", "pages": [ - "deployment/arroyo-cluster" + "deployment/overview", + "deployment/ec2", + "deployment/nomad", + "deployment/kubernetes" ] }, {