Skip to content

Commit 689345f

Browse files
authored
Merge pull request #562 from FederatedAI/dev-2.1.0
v2.1.0
2 parents 9b49feb + 816a84e commit 689345f

32 files changed

+522
-32
lines changed

RELEASE.md

+10
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
## Release 2.1.0
2+
### Major Features and Improvements
3+
Improved the display issue of output data.
4+
Enhanced the PyPI package: configuration files have been relocated to the user's home directory, and the relative paths for uploading data are based on the user's home directory.
5+
Supported running FATE algorithms in Spark on YARN client mode.
6+
7+
### Bug-Fix
8+
Fixed an issue where failed tasks could not be retried.
9+
Fixed an issue where the system couldn't run when the task cores exceeded the system total cores.
10+
111
## Release 2.0.0
212
### Major Features and Improvements
313
* Adapted to new scalable and standardized federated DSL IR

doc/fate_access.md

+195
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# FATE 2.0 Version Interconnection Guide
2+
3+
## 1. FATE Flow Integration Guide
4+
- Description: This section provides guidance on integrating heterogeneous scheduling platforms with the FATE scheduling platform's FATE Flow.
5+
- Scenario: This side is the system to be integrated, and the partner is the FATE site.
6+
7+
### 1.1 Interfaces
8+
![api](./images/fate_flow_api.png)
9+
FATE Flow interfaces are divided into 4 categories:
10+
- 1.responsible for receiving requests from upper-level systems, such as submitting, stopping, and querying jobs;
11+
- 2.responsible for receiving requests from the scheduling layer, such as starting and stopping tasks;
12+
- 3.responsible for receiving requests from algorithm containers, such as task status, input reporting, etc.;
13+
- 4.responsible for receiving requests from the platform layer and distributing them to the interfaces of the participating parties.
14+
15+
#### 1.1.1 api-1
16+
Description: Since it is about integrating with the upper-level system and does not involve interaction between schedulers, this interface is optional and can be customized without constraints.
17+
18+
#### 1.1.2 api-2
19+
Refer to [interface](./../python/fate_flow/apps/partner/partner_app.py) implementation
20+
- `/v2/partner/job/create`: Create a job
21+
- `/v2/partner/job/start`: Start a job
22+
- `/v2/partner/job/status/update`: Update job status
23+
- `/v2/partner/job/update`: Update job (e.g., progress information)
24+
- `/v2/partner/job/resource/apply`: Apply for job resources
25+
- `/v2/partner/job/resource/return`: Return job resources
26+
- `/v2/partner/job/stop`: Stop job
27+
- `/v2/partner/task/resource/apply`: Apply for task resources
28+
- `/v2/partner/task/resource/return`: Return task resources
29+
- `/v2/partner/task/start`: Start task
30+
- `/v2/partner/task/collect`: Scheduler collects task status
31+
- `/v2/partner/task/status/update`: Update task status
32+
- `/v2/partner/task/stop`: Stop task
33+
- `/v2/partner/task/rerun`: Rerun task
34+
35+
#### 1.1.3 api-3
36+
Refer to [interface](./../python/fate_flow/apps/worker/worker_app.py) implementation
37+
- `/v2/worker/task/status`: Status report
38+
- `/v2/worker/model/save`: Save model
39+
- `/v2/worker/model/download`: Download model
40+
- `/v2/worker/data/tracking/query`: Query data
41+
- `/v2/worker/data/tracking/save`: Record data
42+
- `/v2/worker/metric/save/<execution_id>`: Record metrics
43+
44+
#### 1.1.4 api-4
45+
Refer to [interface](./../python/fate_flow/apps/scheduler/scheduler_app.py) implementation
46+
- `/v2/scheduler/job/create`: Create a job
47+
- `/v2/scheduler/job/stop`: Stop a job
48+
- `/v2/scheduler/task/report`: Task report (e.g., status)
49+
- `/v2/scheduler/job/rerun`: Rerun a job
50+
51+
### 1.2 Scheduler
52+
The scheduler mainly consists of two parts: scheduling logic and scheduling interface. In the case of interconnection in a heterogeneous scenario, a unified scheduling process and interface are indispensable. In the case mentioned above, when using FATE Flow as the scheduling party in connection with other vendors, the implementation of the scheduler can be ignored.
53+
54+
#### 1.2.1 Approach
55+
The core of scheduling is the scheduling process, which defines the lifecycle of a job. In version 1.x of FATE, the scheduler and the initiator logic are bound, meaning the coordination scheduling of jobs from multiple parties is done at the initiator. This has a disadvantage: suppose companies A, B, and C each have the need to initiate tasks, their scheduling layers need to implement the scheduler based on the same scheduling logic, and the cost of interconnection is high. In version 2.0, the initiator and scheduler logic in the scheduling module are decoupled, and the scheduler can be specified in the job configuration. In the above case, as long as any one of A, B, or C companies implements the scheduler, or directly uses FATE as the scheduler, other vendors only need to implement the scheduler client interface to meet the requirements, greatly reducing the cost of interconnection.
56+
57+
![scheduler](./images/scheduler.png)
58+
<p style="text-align:center;">P represents the scheduling client interface, S represents the scheduler interface</p>
59+
60+
61+
To illustrate this scheduling mode with an example: Suppose A wants to create a job with C, and FATE Flow is the scheduler. First, A requests the FATE-Flow S (create-job) interface. After receiving the request, FATE Flow obtains participant information (A, C) through job configuration, and then distributes it to the P (create-job) interface of each participant.
62+
63+
#### 1.2.2 Scheduling Logic
64+
It manages the lifecycle of jobs, including when to start and stop jobs, when to start and stop tasks, DAG parsing, and component runtime dependencies, etc. FATE Flow's scheduling process is divided into two modes based on task status acquisition: callback and poll. Among them, the callback mode is for the participants to actively report task status to the scheduler, and the poll mode is for the scheduler to pull task status from the participants at regular intervals. The scheduling process diagrams for the two modes are as follows:
65+
66+
![schedule-for-callback](./images/schedule_for_callback.png)
67+
<p style="text-align:center;">Callback Mode</p>
68+
69+
70+
![schedule-for-poll](./images/schedule_for_poll.png)
71+
<p style="text-align:center;">Poll Mode</p>
72+
73+
74+
#### 1.2.3 Scheduling Interface
75+
Responsible for receiving requests from the platform layer and distributing them to the interfaces of various participants [api-2](#api-2), such as creating jobs, stopping jobs, etc. Interfaces see [api-4](#api-4)
76+
77+
78+
## 2 Algorithm Integration Guide
79+
In previous versions of FATE, algorithms ran as local processes started by the scheduling service, and there were shortcomings in terms of scalability, making it difficult to meet the needs of interconnection. In version 2.0, the "algorithm container" is used to run algorithms, implementing heterogeneous algorithm scheduling functionality through a standardized algorithm image construction and loading mechanism.
80+
81+
![scheduler](./images/federationml_schedule.png)
82+
83+
### 2.1 FATE Algorithm Containerization Solution
84+
- Pre-processing: Input processing for data, models, algorithm parameters, etc., will call the platform-layer interface [api-3](#api-3) to obtain relevant dependencies.
85+
- Component runtime: Algorithm component logic.
86+
- Post-processing: Output content processing for algorithm components, will call the platform-layer interface [api-3](#api-3) to upload the output to the platform.
87+
![](./images/schedule_for_component.png)
88+
89+
### 2.2 Integration
90+
#### 2.2.1 Algorithm Parameters
91+
FATE Flow will pass parameters to the algorithm container in the form of environment variables, with the key being "CONFIG" and the parameter value being a JSON string. The content is as follows:
92+
```
93+
component: psi
94+
computing_partitions: 8
95+
conf:
96+
computing:
97+
metadata:
98+
computing_id: 202402271112016150790_psi_0_0_host_9998
99+
host:127.0.0.1
100+
port:4670
101+
type: standalone/eggroll/spark
102+
device:
103+
metadata: {}
104+
type: CPU
105+
federation:
106+
metadata:
107+
federation_id: 202402271112016150790_psi_0_0
108+
parties:
109+
local:
110+
partyid: '9998'
111+
role: host
112+
parties:
113+
- partyid: '9999'
114+
role: guest
115+
- partyid: '9998'
116+
role: host
117+
osx_config:
118+
host: 127.0.01
119+
port: 9370
120+
type: osx
121+
logger:
122+
config:
123+
storage: standalone/eggroll/hdfs
124+
engine_run:
125+
cores: 4
126+
input_artifacts:
127+
data:
128+
input_data:
129+
output_artifact_key: output_data
130+
output_artifact_type_alias: null
131+
parties:
132+
- party_id:
133+
- '9998'
134+
role: host
135+
producer_task: reader_0
136+
model: null
137+
job_id: '202402271112016150790'
138+
launcher_conf: {}
139+
launcher_name: default
140+
mlmd:
141+
metadata:
142+
api_version: v2
143+
host: 127.0.0.1
144+
port: 9380
145+
protocol: http
146+
type: flow
147+
model_id: '202402271112016150790'
148+
model_version: '0'
149+
parameters: {}
150+
party_id: '9998'
151+
party_task_id: 202402271112016150790_psi_0_0_host_9998
152+
provider_name: fate
153+
role: host
154+
stage: default
155+
task_id: 202402271112016150790_psi_0
156+
task_name: psi_0
157+
task_version: '0'
158+
```
159+
Here are the key configurations:
160+
- `component`: The name of the algorithm. When multiple algorithms are packaged in the same image, this parameter is used to identify them.
161+
- `conf.computing`: Configuration for the computing engine.
162+
- `conf.federation`: Configuration for the communication engine.
163+
- `conf.storage`: Configuration for the storage engine, supporting standalone/eggroll and hdfs.
164+
- `mlmd`: Platform-layer interface used for recording the output of the algorithm. The interface is [api-3](#api-3).
165+
- `input_artifacts`: Input dependencies, including data, models, etc.
166+
- `parameters`: Algorithm parameters.
167+
The entry point for starting the algorithm needs to be specified with CMD when building the image, and the algorithm should call the status reporting interface in [api-3](#api-3) upon completion.
168+
169+
170+
#### 2.2.2 Registering Algorithm Image
171+
```shell
172+
flow provider register -c examples/provider/register_image.json
173+
```
174+
Where `register_image.json` looks like this:
175+
```json
176+
{
177+
"name": "fate",
178+
"device": "docker",
179+
"version": "2.1.0",
180+
"metadata": {
181+
"base_url": "",
182+
"image": "federatedai/fate:2.1.0"
183+
}
184+
}
185+
```
186+
187+
#### 2.2.3 Using Algorithm Image
188+
After registration, in the DAG of the job configuration, you can specify the provider to run this FATE algorithm image, as shown below:
189+
```yaml
190+
dag:
191+
conf:
192+
task:
193+
provider: fate:2.1.0@docker
194+
```
195+
Alternatively, you can specify this image for a specific algorithm. For details, refer to the [provider guide](./provider_register.md).

0 commit comments

Comments
 (0)