Hot reloading Python Operator #239

haixuanTao · 2023-03-28T13:45:41Z

This PR add the capability to hot-reload python when the files are changed. Hot-reloading means that when the user change the python operator file, it will immediately take effect within the dataflow without loosing the current state of the operator.

Purpose

Sometimes reloading a robotic environment ( both virtual and physical ) might be really time-consuming.
Being able to hot-reload an operator makes iteration on change a lot quicker.

Implementation

Schematic explanation of the implementation:

dora-cli listen for file changes and push Reload event if it notice file changes.
Then, there is a long chain of message passing as follows:
dora-cli -> dora-coordinator -> dora-daemon -> Custom Nodes API -> Runtime Nodes -> Operator -> Python Operator API
Then the Python Operator will attempt to reload the operator.
- This is possible using "importlib.reload" in python
- Tracking state using "operator.dict" that store all user defined values.

This is based on notify . The implementation is differentiation based.

Fail-safe mechanism:

Failing the initialization on reloading is going to abort the reloading and not crash the operator.
State-values stored in the before-reloading Operator will be forwarded to the reloaded operator.
If new state-value is been set in the initialization, they will be available in the reloaded operator.
If state-value has been changed in the initialization but was already used before, they will keep their before-reloading value ( used or not ). It is therefore not possible to change currently being used state-value stored in the Operator class.
You can use global variable if you wish to have 'changeable' variables.

Not included in this Pull Request

Custom Nodes Reloading
Not Python Operator Reloading

phil-opp

That's a great feature!

There are two aspects of the current implementation that I'm not sure about:

I'm not sure if the daemon is the right place for the file watching.
- Right now the daemon only runs on the local machine, but we plan to support deployments to remote deaemons too.
- The file watching should probably always happen on the development machine.
I don't think that we should enable live-reloads by default.
- It can be suprising that modifying a file affects a dataflow that was started earlier.

I think the typical approach to features like this is to provide some sort of explicit watch command. How about the following:

We add a --watch flag to the dora-cli start command. If this flag is given, the command does not once the dataflow is started, but keeps watching the dataflow YAML file and the referenced nodes/operators.
When a change is detected, the CLI prints an info message (or a warning is the dataflow is no longer valid) and sends a message to the coordinator.
- (Once we add remote deployments in the future, the message could also contain the new executable.)
The coordinator forwards the reload message to the daemon(s) that the dataflow runs on.

binaries/runtime/src/operator/mod.rs

phil-opp · 2023-03-28T14:42:55Z

binaries/runtime/src/operator/shared_lib.rs

+                IncomingEvent::Reload => {
+                    // TODO: Reload shared library.
+                    continue;
+                }


I don't think that we can safely reload a shared library. It's already quite difficult for Rust code and doing it across languages (C, C++, Rust, etc) is even more difficult. So I think we should never try to reload shared libraries and log a warning here.

quite difficult for Rust code

That's an excellent article. I naively thought that we could just swap context just like in Python. Thanks!

So, instead of reloading shared library, what about re-spawning a Custom node or Runtime, copying the previous state in shared memory and passing it to the newly spawned Custom node or Runtime. This should result in the same behavior as reloading. It might even add the benefit of smaller reloading interruption time.

The issue is that the state is controlled entirely by the operator. The operator gives us a opaque pointer on initialization and the runtime just passes this pointer on on_event calls. Only the operator itself knows what to do with it.

So we cannot just copy the memory referenced by the pointer since we don't know the size of the data. It might even contain pointers to other data internally that is part of the state too. Another challenge is that the new operator might use a different data format, so copying the old state in could result in invalid values and undefined behavior.

The only way to copy the state over is with cooperation between the operator, the runtime, and the new operator. We could for example introduce a serialize_state event that needs to be handled manually by the operator code. The operator needs to serialize all state into some common format (e.g. JSON) and the new operator needs to be able to deserialize and apply this state again. It also needs to be able to migrate the data from old structures in case the data format changed. Since this approach is quite complex and requires manual work, I think it's only suited for some planned updates (e.g. update a running operator from version 1 to version 1.1), but not for hot reloading...

Agreed, thanks for the detailed explanation! Let's see if people use this feature for python, and how they use it, and what state management is more beneficial for the user, that might guide us for the rest.

apis/rust/node/src/daemon_connection/event_stream.rs

haixuanTao · 2023-03-29T12:22:09Z

That's a great feature!

There are two aspects of the current implementation that I'm not sure about:

I'm not sure if the daemon is the right place for the file watching.

Right now the daemon only runs on the local machine, but we plan to support deployments to remote deaemons too.

The file watching should probably always happen on the development machine.

I don't think that we should enable live-reloads by default.

It can be suprising that modifying a file affects a dataflow that was started earlier.

I think the typical approach to features like this is to provide some sort of explicit watch command. How about the following:

We add a --watch flag to the dora-cli start command. If this flag is given, the command does not return once the dataflow is started, but keeps watching the dataflow YAML file and the referenced nodes/operators.

When a change is detected, the CLI prints an info message (or a warning is the dataflow is no longer valid) and sends a message to the coordinator.

(Once we add remote deployments in the future, the message could also contain the new executable.)

The coordinator forwards the reload message to the daemon(s) that the dataflow runs on.

I like the idea of the --watch flag! I forgot that we had indeed had plan to split the code and the deployment.

I'll check it out later.

haixuanTao · 2023-03-30T10:04:15Z

So I added to flags to dora start <DATAFLOW>:

--attach that makes it possible to start a dataflow and not return as long as the dataflow is running. dora-cli does that by periodically query the coordinator for the dataflow status. In case of ctrl+c, dora-cli will send a stop dataflow signal to the coordinator.
--hot-reload that makes it possible when attached to do python hot reloading. All files listed as python operators are going to be watch for change from the cli process and reloaded in case of changes.
In case you want to pause the hot reloading, you can use the Linux standard ctrl+z. That will pause dora-cli file watching ( not the dataflow ). You can then resume it with f+g.

example usage:

dora start dataflow.yml --attach --hot-reload

Note: I change --watch for --attach as it is the semantic of docker.

phil-opp

Looks good now, thanks a lot for your work on this!

binaries/cli/src/attach.rs

binaries/daemon/src/lib.rs

…s to be reloaded

…s, and removing `info!` on `cli` querying of `coordinator`

phil-opp reviewed Mar 28, 2023

View reviewed changes

haixuanTao force-pushed the hot-reloading branch from 45bb4b0 to c07e864 Compare March 30, 2023 09:49

haixuanTao changed the base branch from main to validate-yaml March 30, 2023 09:51

haixuanTao marked this pull request as ready for review March 30, 2023 10:15

haixuanTao force-pushed the hot-reloading branch 4 times, most recently from e6ba7fe to 61182b0 Compare March 30, 2023 11:33

Base automatically changed from validate-yaml to main March 31, 2023 09:49

phil-opp approved these changes Mar 31, 2023

View reviewed changes

binaries/cli/src/attach.rs Outdated Show resolved Hide resolved

binaries/daemon/src/lib.rs Outdated Show resolved Hide resolved

haixuanTao force-pushed the hot-reloading branch from 61182b0 to d19b74b Compare March 31, 2023 11:51

phil-opp approved these changes Mar 31, 2023

View reviewed changes

haixuanTao added 12 commits March 31, 2023 20:34

Add Reload event containing dataflow_id, node_id, op_id that need…

2d76cd5

…s to be reloaded

Add watcher to the daemon to push Reload event

12def83

Hot reload python operator when receiving Reload event

5e30fa6

Use cross-platform generic file watcher

f7941f1

Add Reload and Check enum

bd9c370

Remove daemon file watcher

a990b6f

forward reloading messages from the cli to the daemon

f0195d5

Add attach logic and hot-reloading logic to the cli

8f05ec7

Fix CI issue with not retrieving UUID

edf59d5

Address PR comments by removing unsued event, removing reloading todo…

2f6eb38

…s, and removing `info!` on `cli` querying of `coordinator`

Fix wrongly rebased python operator definition

835f7db

Add dora-cli name for tracing

fb694df

haixuanTao force-pushed the hot-reloading branch from 1b19ed2 to fb694df Compare March 31, 2023 12:38

haixuanTao enabled auto-merge March 31, 2023 12:39

Fix clippy and remove unwanted .unwrap

c2071e7

haixuanTao disabled auto-merge March 31, 2023 12:55

haixuanTao enabled auto-merge March 31, 2023 12:58

haixuanTao merged commit 4f4dfab into main Mar 31, 2023

haixuanTao deleted the hot-reloading branch March 31, 2023 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hot reloading Python Operator #239

Hot reloading Python Operator #239

haixuanTao commented Mar 28, 2023 •

edited

Loading

phil-opp left a comment

phil-opp Mar 28, 2023

haixuanTao Mar 29, 2023 •

edited

Loading

phil-opp Mar 31, 2023

haixuanTao Mar 31, 2023 •

edited

Loading

haixuanTao commented Mar 29, 2023

haixuanTao commented Mar 30, 2023

phil-opp left a comment

Hot reloading Python Operator #239

Hot reloading Python Operator #239

Conversation

haixuanTao commented Mar 28, 2023 • edited Loading

Purpose

Implementation

Fail-safe mechanism:

Not included in this Pull Request

phil-opp left a comment

Choose a reason for hiding this comment

phil-opp Mar 28, 2023

Choose a reason for hiding this comment

haixuanTao Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

phil-opp Mar 31, 2023

Choose a reason for hiding this comment

haixuanTao Mar 31, 2023 • edited Loading

Choose a reason for hiding this comment

haixuanTao commented Mar 29, 2023

haixuanTao commented Mar 30, 2023

phil-opp left a comment

Choose a reason for hiding this comment

haixuanTao commented Mar 28, 2023 •

edited

Loading

haixuanTao Mar 29, 2023 •

edited

Loading

haixuanTao Mar 31, 2023 •

edited

Loading