Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hot reloading Python Operator #239

Merged
merged 13 commits into from
Mar 31, 2023
Merged

Hot reloading Python Operator #239

merged 13 commits into from
Mar 31, 2023

Conversation

haixuanTao
Copy link
Collaborator

@haixuanTao haixuanTao commented Mar 28, 2023

This PR add the capability to hot-reload python when the files are changed. Hot-reloading means that when the user change the python operator file, it will immediately take effect within the dataflow without loosing the current state of the operator.

Purpose

Sometimes reloading a robotic environment ( both virtual and physical ) might be really time-consuming.
Being able to hot-reload an operator makes iteration on change a lot quicker.

Implementation

Schematic explanation of the implementation:

  • dora-cli listen for file changes and push Reload event if it notice file changes.
  • Then, there is a long chain of message passing as follows:
    dora-cli -> dora-coordinator -> dora-daemon -> Custom Nodes API -> Runtime Nodes -> Operator -> Python Operator API
  • Then the Python Operator will attempt to reload the operator.
    • This is possible using "importlib.reload" in python
    • Tracking state using "operator.dict" that store all user defined values.

This is based on notify . The implementation is differentiation based.

Fail-safe mechanism:

  • Failing the initialization on reloading is going to abort the reloading and not crash the operator.
  • State-values stored in the before-reloading Operator will be forwarded to the reloaded operator.
  • If new state-value is been set in the initialization, they will be available in the reloaded operator.
  • If state-value has been changed in the initialization but was already used before, they will keep their before-reloading value ( used or not ). It is therefore not possible to change currently being used state-value stored in the Operator class.
  • You can use global variable if you wish to have 'changeable' variables.

Not included in this Pull Request

  • Custom Nodes Reloading
  • Not Python Operator Reloading

Copy link
Collaborator

@phil-opp phil-opp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great feature!

There are two aspects of the current implementation that I'm not sure about:

  • I'm not sure if the daemon is the right place for the file watching.
    • Right now the daemon only runs on the local machine, but we plan to support deployments to remote deaemons too.
    • The file watching should probably always happen on the development machine.
  • I don't think that we should enable live-reloads by default.
    • It can be suprising that modifying a file affects a dataflow that was started earlier.

I think the typical approach to features like this is to provide some sort of explicit watch command. How about the following:

  • We add a --watch flag to the dora-cli start command. If this flag is given, the command does not once the dataflow is started, but keeps watching the dataflow YAML file and the referenced nodes/operators.
  • When a change is detected, the CLI prints an info message (or a warning is the dataflow is no longer valid) and sends a message to the coordinator.
    • (Once we add remote deployments in the future, the message could also contain the new executable.)
  • The coordinator forwards the reload message to the daemon(s) that the dataflow runs on.

binaries/runtime/src/operator/mod.rs Show resolved Hide resolved
Comment on lines 206 to 204
IncomingEvent::Reload => {
// TODO: Reload shared library.
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we can safely reload a shared library. It's already quite difficult for Rust code and doing it across languages (C, C++, Rust, etc) is even more difficult. So I think we should never try to reload shared libraries and log a warning here.

Copy link
Collaborator Author

@haixuanTao haixuanTao Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite difficult for Rust code

That's an excellent article. I naively thought that we could just swap context just like in Python. Thanks!

So, instead of reloading shared library, what about re-spawning a Custom node or Runtime, copying the previous state in shared memory and passing it to the newly spawned Custom node or Runtime. This should result in the same behavior as reloading. It might even add the benefit of smaller reloading interruption time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the state is controlled entirely by the operator. The operator gives us a opaque pointer on initialization and the runtime just passes this pointer on on_event calls. Only the operator itself knows what to do with it.

So we cannot just copy the memory referenced by the pointer since we don't know the size of the data. It might even contain pointers to other data internally that is part of the state too. Another challenge is that the new operator might use a different data format, so copying the old state in could result in invalid values and undefined behavior.

The only way to copy the state over is with cooperation between the operator, the runtime, and the new operator. We could for example introduce a serialize_state event that needs to be handled manually by the operator code. The operator needs to serialize all state into some common format (e.g. JSON) and the new operator needs to be able to deserialize and apply this state again. It also needs to be able to migrate the data from old structures in case the data format changed. Since this approach is quite complex and requires manual work, I think it's only suited for some planned updates (e.g. update a running operator from version 1 to version 1.1), but not for hot reloading...

Copy link
Collaborator Author

@haixuanTao haixuanTao Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks for the detailed explanation! Let's see if people use this feature for python, and how they use it, and what state management is more beneficial for the user, that might guide us for the rest.

apis/rust/node/src/daemon_connection/event_stream.rs Outdated Show resolved Hide resolved
@haixuanTao
Copy link
Collaborator Author

That's a great feature!

There are two aspects of the current implementation that I'm not sure about:

  • I'm not sure if the daemon is the right place for the file watching.

    • Right now the daemon only runs on the local machine, but we plan to support deployments to remote deaemons too.
    • The file watching should probably always happen on the development machine.
  • I don't think that we should enable live-reloads by default.

    • It can be suprising that modifying a file affects a dataflow that was started earlier.

I think the typical approach to features like this is to provide some sort of explicit watch command. How about the following:

  • We add a --watch flag to the dora-cli start command. If this flag is given, the command does not return once the dataflow is started, but keeps watching the dataflow YAML file and the referenced nodes/operators.

  • When a change is detected, the CLI prints an info message (or a warning is the dataflow is no longer valid) and sends a message to the coordinator.

    • (Once we add remote deployments in the future, the message could also contain the new executable.)
  • The coordinator forwards the reload message to the daemon(s) that the dataflow runs on.

I like the idea of the --watch flag! I forgot that we had indeed had plan to split the code and the deployment.

I'll check it out later.

@haixuanTao haixuanTao changed the base branch from main to validate-yaml March 30, 2023 09:51
@haixuanTao
Copy link
Collaborator Author

So I added to flags to dora start <DATAFLOW>:

  • --attach that makes it possible to start a dataflow and not return as long as the dataflow is running. dora-cli does that by periodically query the coordinator for the dataflow status. In case of ctrl+c, dora-cli will send a stop dataflow signal to the coordinator.

  • --hot-reload that makes it possible when attached to do python hot reloading. All files listed as python operators are going to be watch for change from the cli process and reloaded in case of changes.

  • In case you want to pause the hot reloading, you can use the Linux standard ctrl+z. That will pause dora-cli file watching ( not the dataflow ). You can then resume it with f+g.

example usage:

dora start dataflow.yml --attach --hot-reload

Note: I change --watch for --attach as it is the semantic of docker.

@haixuanTao haixuanTao marked this pull request as ready for review March 30, 2023 10:15
@haixuanTao haixuanTao force-pushed the hot-reloading branch 4 times, most recently from e6ba7fe to 61182b0 Compare March 30, 2023 11:33
Base automatically changed from validate-yaml to main March 31, 2023 09:49
Copy link
Collaborator

@phil-opp phil-opp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks a lot for your work on this!

binaries/cli/src/attach.rs Outdated Show resolved Hide resolved
binaries/daemon/src/lib.rs Outdated Show resolved Hide resolved
@haixuanTao haixuanTao merged commit 4f4dfab into main Mar 31, 2023
@haixuanTao haixuanTao deleted the hot-reloading branch March 31, 2023 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants