Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move the documentation to the website and complete it #166

Merged
merged 9 commits into from
Jul 14, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 9 additions & 122 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,129 +1,12 @@
# Ansible deployment
# ScyllaDB Migrator

An ansible playbook is provided in ansible folder. The ansible playbook will install the pre-requisites, spark, on the master and workers added to the `ansible/inventory/hosts` file. Scylla-migrator will be installed on the spark master node.
1. Update `ansible/inventory/hosts` file with master and worker instances
2. Update `ansible/ansible.cfg` with location of private key if necessary
3. The `ansible/template/spark-env-master-sample` and `ansible/template/spark-env-worker-sample` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them.
4. run `ansible-playbook scylla-migrator.yml`
5. On the spark master node:
cd scylla-migrator
`./start-spark.sh`
6. On the spark worker nodes:
`./start-slave.sh`
7. Open spark web console
- Ensure networking is configured to allow you access spark master node via 8080 and 4040
- visit http://<spark-master-hostname>:8080
8. Review and modify `config.yaml` based whether you're performing a migration to CQL or Alternator
- If you're migrating to Scylla CQL interface (from Cassandra, Scylla, or other CQL source), make a copy review the comments in `config.yaml.example`, and edit as directed.
- If you're migrating to Alternator (from DynamoDB or other Scylla Alternator), make a copy, review the comments in `config.dynamodb.yml`, and edit as directed.
9. As part of ansible deployment, sample submit jobs were created. You may edit and use the submit jobs.
- For CQL migration: Edit `scylla-migrator/submit-cql-job.sh`, change line `--conf spark.scylla.config=config.yaml \` to point to the whatever you named the config.yaml in previous step.
- For Alternator migration: Edit `scylla-migrator/submit-alternator-job.sh`, change line `--conf spark.scylla.config=/home/ubuntu/scylla-migrator/config.dynamodb.yml \` to reference the config.yaml file you created and modified in previous step.
10. Ensure the table has been created in the target environment.
11. Submit the migration by submitting the appropriate job
- CQL migration: `./submit-cql-job.sh`
- Alternator migration: `./submit-alternator-job.sh`
12. You can monitor progress by observing the spark web console you opened in step 7. Additionally, after the job has started, you can track progress via http://<spark-master-hostname>:4040.
FYI: When no spark jobs are actively running, the spark progress page at port 4040 displays unavailable. It is only useful and renders when a spark job is in progress.
The ScyllaDB Migrator is a Spark application that migrates data to ScyllaDB from CQL-compatible or DynamoDB-compatible databases.

# Configuring the Migrator
## Documentation

Create a `config.yaml` for your migration using the template `config.yaml.example` in the repository root. Read the comments throughout carefully.
See https://migrator.docs.scylladb.com.

# Running on a live Spark cluster

The Scylla Migrator is built against Spark 3.5.1, so you'll need to run that version on your cluster.

Download the latest [release](https://github.com/scylladb/scylla-migrator/releases) of the migrator:

~~~ sh
wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar
~~~

Alternatively, you can [build](#building) a custom version of the migrator.

Copy the jar `scylla-migrator-assembly.jar` and the `config.yaml` you've created to the Spark master server.

Start the spark master and slaves.
`cd scylla-migrator`
`./start-spark.sh`

On worker instances:
`./start-slave.sh`

Configure and confirm networking between:
- source and spark servers
- target and spark servers

Create schema in target server.

Then, run this command on the Spark master server:
```shell
spark-submit --class com.scylladb.migrator.Migrator \
--master spark://<spark-master-hostname>:7077 \
--conf spark.scylla.config=<path to config.yaml> \
<path to scylla-migrator-assembly.jar>
```

If you pass on the truststore file or ssl related files use `--files` option:
```shell
spark-submit --class com.scylladb.migrator.Migrator \
--master spark://<spark-master-hostname>:7077 \
--conf spark.scylla.config=<path to config.yaml> \
--files truststorefilename \
<path to scylla-migrator-assembly.jar>
```

# Running the validator

This project also includes an entrypoint for comparing the source
table and the target table. You can launch it as so (after performing
the previous steps):

```shell
spark-submit --class com.scylladb.migrator.Validator \
--master spark://<spark-master-hostname>:7077 \
--conf spark.scylla.config=<path to config.yaml> \
<path to scylla-migrator-assembly.jar>
```

# Running locally

To run in the local Docker-based setup:

1. First start the environment:
```shell
docker compose up -d
```

2. Launch `cqlsh` in Cassandra's container and create a keyspace and a table with some data:
```shell
docker compose exec cassandra cqlsh
<create stuff>
```

3. Launch `cqlsh` in Scylla's container and create the destination keyspace and table with the same schema as the source table:
```shell
docker compose exec scylla cqlsh
<create stuff>
```

4. Edit the `config.yaml` file; note the comments throughout.

5. Run `build.sh`.

6. Then, launch `spark-submit` in the master's container to run the job:
```shell
docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \
--master spark://spark-master:7077 \
--conf spark.driver.host=spark-master \
--conf spark.scylla.config=/app/config.yaml \
/jars/scylla-migrator-assembly.jar
```

The `spark-master` container mounts the `./migrator/target/scala-2.13` dir on `/jars` and the repository root on `/app`. To update the jar with new code, just run `build.sh` and then run `spark-submit` again.

# Building
## Building

To test a custom version of the migrator that has not been [released](https://github.com/scylladb/scylla-migrator/releases), you can build it yourself by cloning this Git repository and following the steps below:

Expand All @@ -132,3 +15,7 @@ To test a custom version of the migrator that has not been [released](https://gi
JDK installation.
3. Run `build.sh`.
4. This will produce the .jar file to use in the `spark-submit` command at path `migrator/target/scala-2.13/scylla-migrator-assembly.jar`.

## Contributing

Please refer to the file [CONTRIBUTING.md](/CONTRIBUTING.md).
2 changes: 1 addition & 1 deletion ansible/templates/spark-env-master-sample
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# MEMORY is used in the spark-submit job and allocates the memory per executor.
# You can have one or more executors per worker.
#
# By using multiple workers on an instance, we can control the velocit of the migration.
# By using multiple workers on an instance, we can control the velocity of the migration.
#
# Eg.
# Target system is 3 x i4i.4xlarge (16 vCPU, 128G)
Expand Down
2 changes: 1 addition & 1 deletion ansible/templates/spark-env-worker-sample
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# MEMORY is used in the spark-submit job and allocates the memory per executor.
# You can have one or more executors per worker.
#
# By using multiple workers on an instance, we can control the velocit of the migration.
# By using multiple workers on an instance, we can control the velocity of the migration.
#
# Eg.
# Target system is 3 x i4i.4xlarge (16 vCPU, 128G)
Expand Down
3 changes: 1 addition & 2 deletions config.yaml.example
Original file line number Diff line number Diff line change
Expand Up @@ -268,8 +268,7 @@ renames: []
# create a savepoint file with this filled.
skipTokenRanges: []

# Configuration section for running the validator. The validator is run manually (see README)
# and currently only supports comparing a Cassandra source to a Scylla target.
# Configuration section for running the validator. The validator is run manually (see README).
validation:
# Should WRITETIMEs and TTLs be compared?
compareTimestamps: true
Expand Down
25 changes: 3 additions & 22 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -1,23 +1,4 @@
version: '3'

services:
scylla:
image: scylladb/scylla:latest
networks:
- scylla
volumes:
- ./data/scylla:/var/lib/scylla
ports:
- "8000:8000"
command: "--smp 2 --memory 2048M --alternator-port 8000 --alternator-write-isolation always_use_lwt"

cassandra:
image: cassandra:latest
networks:
- scylla
volumes:
- ./data/cassandra:/var/lib/cassandra

spark-master:
build: dockerfiles/spark
command: master
Expand All @@ -26,7 +7,7 @@ services:
environment:
SPARK_PUBLIC_DNS: spark-master
networks:
- scylla
- spark
expose:
- 7001
- 7002
Expand Down Expand Up @@ -58,7 +39,7 @@ services:
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: spark-worker
networks:
- scylla
- spark
expose:
- 7012
- 7013
Expand All @@ -75,4 +56,4 @@ services:
- spark-master

networks:
scylla:
spark:
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@
"hide_feedback_buttons": "false",
"github_issues_repository": "scylladb/scylla-migrator",
"github_repository": "scylladb/scylla-migrator",
"site_description": "Migrate data extract using Spark to Scylla, normally from Cassandra.",
"site_description": "Migrate data using Spark from Cassandra or DynamoDB to Scylla.",
"hide_version_dropdown": [],
"zendesk_tag": "",
"versions_unstable": UNSTABLE_VERSIONS,
Expand Down
Loading
Loading