Skip to content

Commit

Permalink
Add the cdc readme and usage doc
Browse files Browse the repository at this point in the history
Signed-off-by: SimFG <bang.fu@zilliz.com>
  • Loading branch information
SimFG committed Oct 30, 2023
1 parent 9a36991 commit def2ab3
Show file tree
Hide file tree
Showing 6 changed files with 333 additions and 13 deletions.
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
PWD := $(shell pwd)

build:
$(MAKE) -C server build

test-go:
@echo "Running go unittests..."
@(env bash $(PWD)/scripts/run_go_unittest.sh)
Expand Down
40 changes: 39 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,41 @@
# Milvus-CDC

Milvus-CDC is a change data capture tool for Milvus. It can capture the changes of upstream Milvus collections and sink them to downstream Milvus.
CDC is "Change Data Capture", and Milvus-CDC is a change data capture tool for Milvus. It can capture the changes of upstream Milvus collections and sink them to downstream Milvus. This will bring the following benefits:

1. Data reliability is improved and the probability of data loss is reduced;
2. Based on CDC, the active-standby disaster recovery feature of milvus can be implemented to ensure that even if the milvus-source cluster fails, it can be quickly switched to ensure the availability of upper-layer services;

## Quick Start

**Please use the source code to compile and use it now**, because milvus-cdc is undergoing rapid iteration, the image currently provided is not stable, and the implementation method has also changed a lot.

```bash
git clone https://github.com/zilliztech/milvus-cdc.git

make build
```

After successfully building, the `cdc` bin file will be generated in the `server` directory.

**DON'T execute it directly.** If you do it, I think you must get a error because you need to configure it before using it. How to configure and use cdc, refer to: [Milvus-CDC Usage](doc/cdc-usage.md)

## Basic Components

At present, cdc mainly consists of two parts: http server and corelib.

- The http server, is responsible for accepting user-side requests, controlling task execution, and maintaining meta-information;
- corelib, is responsible for synchronizing the execution of tasks, including reader and writer:
- reader reads relevant information from etcd and mq of source Milvus;
- The writer converts the msg in mq into Milvus api parameters and sends the request to the target Milvus;

![components](doc/pic/milvus-cdc-components.png)

## CDC Data Processing Flow

1. User creates cdc task through http interface;
2. Obtain collection-related meta-information through etcd in Milvus-source, such as the channel information and checkpoint information corresponding to the collection, etc;
3. After obtaining the meta-information related to the collection, connect to mq(message queue) to subscribe to the data;
4. Read the data in mq, parse the data and forward it through go-sdk or perform the same operation as milvus-source;

![flow](doc/pic/milvus-cdc-data.png)

291 changes: 291 additions & 0 deletions doc/cdc-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
# Milvus-CDC Usage

## Limitation

1. Only the cluster dimension can be synchronized, that is, the collection name can only be specified as `*` when creating a task. The collection dimension will also be supported in the future.
2. Supported operations
- Create/Drop Collection
- Insert/Delete
- Create/Drop Partition
- Create/Drop Index
- Load/Release/Flush
- Create/Drop Database

**Anything not mentioned is not supported;**

3. Milvus cdc only supports synchronized data. If you need active and backup disaster recovery functions, please contact us;

## Configuration

The configuration consists of two main parts: the Milvus configuration for Milvus-Target and the CDC startup configuration.

Furthermore, please ensure that both the source and target Milvus versions are **2.3.2 or higher**. Any other versions are not supported.

1. Milvus-Target Config

Set the value of `common.ttMsgEnabled` in the milvus.yaml file in the milvus-target cluster to `false`.

2. CDC Startup Config

The following is the cdc startup configuration file template. You can see this file in the `server/configs` directory.

```yaml
# cdc server address
address: 0.0.0.0:8444
# max task num
maxTaskNum: 100
# max task name length
maxNameLength: 256

# cdc meta data config
metaStoreConfig:
# the metastore type, available value: etcd, mysql
storeType: etcd
# etcd address
etcdEndpoints:
- localhost:2379
# mysql connection address
mysqlSourceUrl: root:root@tcp(127.0.0.1:3306)/milvus-cdc?charset=utf8
# meta data prefix, if multiple cdc services use the same store service, you can set different rootPaths to achieve multi-tenancy
rootPath: cdc

# milvus-source config, these settings are basically the same as the corresponding configuration of milvus.yaml in milvus source.
sourceConfig:
# etcd config
etcdAddress:
- localhost:2379
etcdRootPath: by-dev
etcdMetaSubPath: meta
# default partition name
defaultPartitionName: _default
# read buffer length, mainly used for buffering if writing data to milvus-target is slow.
readChanLen: 10
# milvus-target mq config, which is pulsar or kafka
pulsar:
address: pulsar://localhost:6650
webAddress: localhost:80
maxMessageSize: 5242880
tenant: public
namespace: default
# authPlugin: org.apache.pulsar.client.impl.auth.AuthenticationToken
# authParams: token:xxx
# kafka:
# address: 127.0.0.1:9092
```

After completing these two steps of configuration, do not rush to start CDC. Need to ensure the following points:

1. Whether the `common.ttMsgEnabled` configuration value of milvus-target is `false``;
2. Confirm that the `mq` type configured by cdc is the same as the type configured by milvus-source;
3. Ensure that the network environment where the cdc service is located can correctly connect to the mq and etcd addresses in the milvus-source in the configuration;

## Usage

All http APIs need to comply with the following rules:

- Request method: POST
- Request path:/cdc
- Request body:

```json
{
"request_type": "",
"request_data": {}
}
```

Different requests are distinguished by `request_type`, which currently includes: create, delete, pause, resume, get and list. If the request fails, a non-200 http status code will be returned.

### create request

- milvus_connect_param, the connection params of the milvus-target server;
- collection_infos, the collection information that needs to be synchronized, which currently only supports `*`;
- rpc_channel_info, the corresponding name value is composed of the two values ​​of `common.chanNamePrefix.cluster` and `common.chanNamePrefix.replicateMsg` in **milvus-source**, connected by the symbol `-`

```http
POST localhost:8444/cdc
Content-Type: application/json
body:
{
"request_type":"create",
"request_data":{
"milvus_connect_param":{
"host":"localhost",
"port":19530,
"username":"root",
"password":"Milvus",
"enable_tls":true,
"connect_timeout":10
},
"collection_infos":[
{
"name":"*"
}
],
"rpc_channel_info": {
"name": "by-dev-replicate-msg"
}
}
}
```

After success, the task_id will be returned, such as:

```json
{"task_id":"6623ae52d35842a5a2c9d89b16ed7aa1"}
```

If there is an exception, an http error will appear.

### delete request

delete a cdc task.

**request**

```http
POST localhost:8444/cdc
Content-Type: application/json
body:
{
"request_type":"delete",
"request_data": {
"task_id": "f84605ae48fb4170990ab80addcbd71e"
}
}
```

**response**

```json
{}
```

### pause request

pause a cdc task.

**request**

```http
POST localhost:8444/cdc
Content-Type: application/json
body:
{
"request_type":"pause",
"request_data": {
"task_id": "4d458a58b0f74e85b842b1244dc69546"
}
}
```

**response**

```json
{}
```

### resume request

resume a cdc task.

**request**

```http
POST localhost:8444/cdc
Content-Type: application/json
body:
{
"request_type":"resume",
"request_data": {
"task_id": "4d458a58b0f74e85b842b1244dc69546"
}
}
```

**response**

```json
{}
```

### get request

get a cdc task info

**request**

```http
POST localhost:8444/cdc
Content-Type: application/json
body:
{
"request_type":"get",
"request_data": {
"task_id": "4d458a58b0f74e85b842b1244dc69546"
}
}
```

**response**

```json
{
"task_id":"4d458a58b0f74e85b842b1244dc69546",
"Milvus_connect_param":{
"host":"localhost",
"port":19530,
"connect_timeout":10
},
"collection_infos":[
{
"name":"*"
}
],
"state":"Running"
}
```

### list request

list the info of all cdc tasks.

**request**

```http
POST localhost:8444/cdc
Content-Type: application/json
body:
{
"request_type":"list"
}
```

**response**

```json
{
"tasks": [
{
"task_id": "4d458a58b0f74e85b842b1244dc69546",
"Milvus_connect_param": {
"host": "localhost",
"port": 19530,
"connect_timeout": 10
},
"collection_infos": [
{
"name": "*"
}
],
"state": "Running"
}
]
}
```
Binary file added doc/pic/milvus-cdc-components.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/pic/milvus-cdc-data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 0 additions & 12 deletions server/configs/cdc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,4 @@ sourceConfig:
# authParams: token:xxx
# kafka:
# address: 127.0.0.1:9092
enableReverse: false
reverseMilvus:
host: localhost
port: 19530
username: root
password: 123456
enable_tls: false
ignore_partition: true
connect_timeout: 10
currentMilvus:
host: 127.0.0.1
port: 19530
maxNameLength: 256

0 comments on commit def2ab3

Please sign in to comment.