Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA redesign discussion (once again) #346

Closed
marvinroger opened this issue Jun 1, 2017 · 32 comments
Closed

OTA redesign discussion (once again) #346

marvinroger opened this issue Jun 1, 2017 · 32 comments

Comments

@marvinroger
Copy link
Member

There are 2 OTA proposals, so I guess there's some interest in changing the OTA design. As discussed with some of you, MQTT is not meant to handle RPC style communication. The current implementation does RPC.

Please understand that the v2.0.0 is not yet final, so it's "legit" to change the way OTA works before final release.

Here are the choices:

1. As @brodtm proposed (simplified a bit, see #344):

  • The OTA entity stores the firmware retained to a global homie-esp8266 "bucket" topic. Say homie/$esp8266_firmwares/<fw checksum>
  • The OTA entity sends retained to the device $implementation/ota/target its target firmware checksum
  • The ESP8266 checks that the target checksum is different than the current one; if so, subscribe to homie/$esp8266_firmwares/<fw checksum> and flash the firmware

Benefits

@brodtm: The advantage of this approach is that the mqtt broker does not have to store a copy of the OTA binary file (about 300-400KB) for each device that requires an update, it only needs to hold a single copy.

Drawbacks

  • As firmwares are sent as retained, firmwares (~350kB) are stored on the broker.
  • It takes two messages to actually flash a firmware (three if we want to cleanup the firmware, in case we want a punctual OTA for a single device)

2. As @boneskull proposed (modified a bit, see #325):

  • The OTA entity sends retained to the device $implementation/ota/firmware/<fw checksum> its target firmware checksum
  • The ESP8266 checks that the checksum is different than the current on; if so, flash the firmware that is actually the payload of the message

Benefits

One message to flash a firmware (two if we want to cleanup the firmware after the OTA)

Drawbacks

  • As firmwares are sent as retained, firmwares (~350kB) are stored on the broker for each device, even if the firmware if the same

3. Currently (simplified):

  • The OTA entity publishes as retained to the device the latest available firmware MD5 to $implementation/ota/checksum
  • If checksum is not the same, the ESP8266 asks the OTA entity to publish the firmware
  • The OTA entity publishes the firmware to $implementation/ota/firmware, and the firmware updates

Benefits

No need to store the firmware beforehand on the broker.

Drawbacks

  • We do RPC
  • It takes two messages to actually flash a firmware

Comparison

1 (@brodtm) 2 (@boneskull) Current
Does not do RPC 🔴
Does not pollute broker 🔴 🔴
Simplicity (out of 3) 2 3 1

Thinking about it, I think I still prefer the current approach... But let me know what you think. You can of course propose something else.

If you don't want to be involved in the discussion (even if you do actually), please vote below:

@marvinroger
Copy link
Member Author

I vote 1 (@brodtm)

@marvinroger
Copy link
Member Author

I vote 2 (@boneskull)

@marvinroger
Copy link
Member Author

I vote current

@boneskull
Copy link

I vote for my own 🙂

Is firmware stored on a broker (or traveling multiple times over a LAN) really a concern? Because if the target is home automation, users are unlikely to have 100s or 1000s of devices to flash. Also, if we don't set the retain bit (and use, say, QoS 2 instead), it won't be stored any longer than it takes to send an ACK, right?

@boneskull
Copy link

my use case: I was trying to automate OTA updates with Node-RED, which did not work well due to the RPC.

@marvinroger
Copy link
Member Author

@boneskull Could you please use the +1 feature, it'll be easier to track. 😇

But without retained message, deep sleep devices won't be able to accept OTA. With retained messages, moreover, the device will receive ~350kB of useless data on each boot. Which is a lot considering the bandwidth of the ESP8266. Two solutions:

  • The ESP8266 can clean the firmware itself after a successful OTA is done
  • The OTA entity can clean the firmware after a successful OTA is done

That might sound like a good compromise, and I prefer the ESP8266 cleans the firmware approach. What about you?

@boneskull
Copy link

Yes, good idea. Agree.

(Mobile interface doesn't have "reaction" functionality)

@bertmelis
Copy link
Contributor

bertmelis commented Jun 1, 2017

I was just thinking the same on gitter!
you also won't accidentaly downgrade if you happen to flash via serial in between without clearing the retained firmware.

@boneskull
Copy link

But--shouldn't QoS 2 work? It guarantees delivery so it'll hang on to the firmware until it's delivered regardless of whether the device is online. It won't resend that data once it's been received.

@bertmelis
Copy link
Contributor

Doesn't the ESP disconnect properly on deep sleep and hence unsubscribes? In that case it won't receive the update.

@boneskull
Copy link

@bertmelis depends if 'clean session' is set or not, I think

@marvinroger
Copy link
Member Author

@boneskull are you sure about that? In QoS 2, the broker keeps the message until it is received once? Even if not retained?

I will try, but if it is the case that would be great!

@bertmelis It subscribes to the OTA topic on setup so maybe it would receive the OTA packet before the user code has a chance to enter into deep sleep... This has to be tested, but if not possible we'll explicitely state in the docs that calling deep sleep alters OTA functionality.

@boneskull
Copy link

boneskull commented Jun 1, 2017 via email

@bertmelis
Copy link
Contributor

@boneskull true, on persistent connections (clean session flag to false), the broker keeps the messages untill delivered once as per OASIS standard. Could it interfere with the normal way of work?

@exxamalte
Copy link

I voted for approach 2 (@boneskull) assuming that some sort of clean-up is happening after flashing the device. I don't have concerns with the payload being (temporarily) stored on the broker, because I can't think of a situation where someone has more than say 3-5 devices of the exact same type at home which would accept the same firmware.
Compatibility with deep sleep is a must-have for me.

@marvinroger marvinroger added this to the v2.0.0-beta.2 milestone Jun 2, 2017
@marvinroger
Copy link
Member Author

Alright, indeed the QoS 2 approach would work great here. One message to flash a firmware (no need to clean it as it will be delivered only once). Sounds great!

@boneskull
Copy link

@bertmelis I think of one way it could cause an issue. If we keep a persistent session across firmwares, then we will potentially have stale subscriptions hanging around.

So, if we can ensure the persistent session is not retained after a firmware upgrade, we should be fine. I'm not sure of the best way to do this, but a dynamic client ID could be one of them.

An API should be exposed to reset the session (clean session after next restart or disconnect & reconnect w/ new session, then unsubscribe and re subscribe to any topics) in case of unforeseen issues.

@boneskull
Copy link

Keep in mind that the ESP8266 must subscribe ar QoS 2 for this to work and the publishing client must do the same.

@marvinroger
Copy link
Member Author

Which stale subs? Do you have an example?

There's also a problem with the AsyncMqttClient, it is not spec compliant when session is enabled (see the docs of the client, in the limitations page).
Maybe we should use the retained flag that we clean after instead of session, the result is the same anyway.

@boneskull
Copy link

if esp8266 w firmware 1.0 subscribes to topic A then you upgrade to firmware 2.0 which subscribes to topic B instead, it will still be subscribed to A if the session is persisted

@boneskull
Copy link

(subscribed to both)

@bertmelis
Copy link
Contributor

So it'll have to unsubscribe just before disconnecting. Don't we have to unsubscribe to all topics if we use a persistent connection? Otherwise a lot of stale subscriptions are potentially left behind.

@joelrader
Copy link

joelrader commented Jun 3, 2017

I'm still learning about firmware updates and how they're done today, but I've also been using Homie-OTA. I really like the simplicity of the web interface, but ultimately want to get things automated via Node-RED. Not sure which of the three approaches listed here would be best for that.

Thinking out loud, ideally I'd like a way to control firmware updates as either push or pull. I could see a set of devices that I'd want to have automatically check for updated firmware and update themselves (aka automatic pull). There could be an option to have an adjustable update delay - that way if you had many devices they all wouldn't update at once. Some type of random prime number delay or a broadcast flag that tells the other devices to wait while a node is updating. A manual pull would require some type of interaction with a device to engage an update. For example, a critical device may not update until a scheduled maintenance window.

The other option, a push, would be triggered by the server. This would effectively force devices to update. The use case I'm thinking of would be for a security update - you'd want all the devices to update ASAP.

After writing this, I imagine much of this could be coded into the device firmware, or would have to be since there's no "server" to speak of. Unless you're using a setup like Homie-OTA or maybe Node-RED.

Really enjoying the Homie project! I've been using Particle devices, which are great, but I appreciate the Homie approach where everything is "in-house".

@marvinroger
Copy link
Member Author

@boneskull @bertmelis to avoid this added complexity of stale subscriptions, and the fact that persistent sessions are not spec compliant on the async-mqtt-client (and will never be as we would have to store entire messages in memory...), let's go with the retained approach?

@Raiderj the most straightforward way is the one we're going to implement (@boneskull's one).
I am not in favor of the "push and pull" approach. Polling would require some sort of RPC, and that's precisely what we're trying to avoid. I think another entity (server-side) should handle the OTA scheduling, and the devices should be dumb as it can, so they have a lighter workload to achieve.

@bertmelis
Copy link
Contributor

I'll forward to it! Shall we have the ESP publish an empty message to clear the retained firmware after flashing?

@marvinroger
Copy link
Member Author

@bertmelis it is already implemented on my side, and I refactored a lot of code so it is now cleaner. I'll soon push. 😉

@jamesmyatt
Copy link
Contributor

I haven't been following this discussion in detail but I would be careful about any approach that depends on QOS 2 since some mqtt brokers and clients don't implement it.

@luebbe
Copy link
Collaborator

luebbe commented Jun 12, 2017

Chiming in with a different subject concerning OTA. I'd like to hook into the OTA progress, like it is possible with ArduinoOTA.onProgress(...), to show the progress on an attached OLED display.
Would that be possible somehow?

@marvinroger
Copy link
Member Author

@nzbuu I've implemented the QoS1 retained approach, so no problem here.

@luebbe definitely! I'll add an OTA_PROGRESS event on which you'll be able to get the progress. 😉

@bertmelis
Copy link
Contributor

A remark tough, as this is a breaking change, you need to upgrade your nodes with care, as you're unable to distinguish the different OTA-procedures based on the Homie-version.

@marvinroger
Copy link
Member Author

marvinroger commented Jun 12, 2017 via email

@luebbe
Copy link
Collaborator

luebbe commented Jun 12, 2017

@marvinroger That sounds nice!
Currently I have ArduinoOTA compiled in and Home OTA active, so I can do both 8-)

euphi added a commit to euphi/homie-esp8266 that referenced this issue Mar 31, 2018
* 🔥 Remove hardcoded keepalive - fix homieiot#301 (homieiot#314)

Remove hardcoded KeepAlive for MQTT connection.
Default value of 15sec is already present in AsyncMQTT library

* 🐎 Improve uptime accuracy (homieiot#315)

improve uptime accuracy by storing milliseconds and only rounding when
publishing value
drawback: less time before rollover, but still long enough

* 🐛 Fix truncated IP (homieiot#318)

published IP is truncated.

* 🎨 Fix warning with parenthesis

* 🐛 Fix pio library.json bad dependency name

* ✨ Add support for static IP and BSSID/MAC, channel of AP (homieiot#327)

* Support for static IP and BSSID/MAC, Channel of AP

- added parameters to config.json which allow to define static ip, mask,
gateway, dns
- added parameters to config.json which allow to define BSSID and
channel of AP

To run device with defined static IP you have to define ip, mask and
gateway together.

To point device to connect to specific BSSID and channel you haveto
define bssid and channel together.

```
{

	"name": "The kitchen light",
	"device_id": "kitchen-light",
	"wifi": {
		"ssid": "Network_1",
		"password": "I'm a Wi-Fi password!",
		"bssid":
		"DE:AD:BE:EF:BA:BE",
		"channel": 1,
		"ip": "192.168.1.5",
		"mask": "255.255.255.0",
		"gw": "192.168.1.1"
	},
	"mqtt": {
		"host": "192.168.1.10",
		"port": 1883,
		"base_topic": "devices/",
		"auth": true,
		"username": "user",
		"password": "pass" i
	},
	"ota": {
		"enabled": true
	},
	"settings": {
		"param1": 55,
		"param2": "abcdefghijklm",
		"param3": true,
		"param4": false,
		"param5": 2147483647,
		"param6": -2147483647,
		"param7": 55,
		"param8": "abcdefghijklm",
		"param9": true,
		"param10": false
	}
}
```

* Addjustments for travis

* Another addjustments for travis

* Fix problem with prepareToSleep

* 🐎 Pass all callbacks by reference
Still store it by value

* 🐎 Make isActive const

* 🎨 Add custom settings value to initial log

* ✨ Abort if default setting value does not pass validator function
Fix homieiot#324

* 📝 Clarify ISSUE_TEMPLATE docs location for homieiot#331

* 📝 Implement new versioned Git docs (homieiot#341)

* ✨ Add versionned in-repo docs

* 🐛 Attempt to fix encrypted key

* 🐛 Fix permission issue on python exec

* 🐛 Add mission import

* 🐛 Fix relative path

* 🐛 Add missing commit

* ✨ Add index

* 🎨 Change index design

* 📝 Update all URLs to new docs

* 🎨 Fix broken doc link in README

* 📝 Add edit link to docs

* 📝 Move firmware_parser.py to scripts folder

* 🐛 Fix out of limits abort message not showing

* 👕 Be less strict on whitespace in comments

* 📝 Add dummy OTA updater script (TODO)

* 🎨 🐎 Use CircleCi instead of Travis CI (homieiot#348)

* ✨ Add Circle CI build

* 🐛 Add working_directory

* 🐛 Attempt to fix perm issue

* ✨ Generate docs from CircleCI

* 🐛 Fix path

* 🐛 Fix chmod permission

* 🐛 Try to run sudo

* 🐛 Attempt to use remote docker

* 🐛 Change bad directories

* 🐛 Remove dependency on Docker

* 🐛 Fix perm problem

* 🐛 Chamge tmp folder

* 🐎 Use CircleCI instead of Travis CI

* 🎨 Add SonoffDualShutters example

* 🐛 Install current lib to platformio

* 🐛 Actually ignore gh-pages in CI

* 🐛 Add missing SonoffDual dep

* 🐛 Fix ArduinoJSON 5.11.0 (homieiot#363)

* ⬆️ Update pio dependencies

* 🎨 Cleanup code a bit

* 🐛 Make sure every announcements packet are sent - closes homieiot#345

* 🎨 Refactor code and implement new OTA system
Closes homieiot#346

* 👕 Fix lint

* ✅ Use new workflow feature from Circle

* ✅ Ignore gh-pages at workflow level

* ✨ Add OTA_PROGRESS event

* 📝 Update docs for new OTA system

* ⬆️ Upgrade AsyncMqttClient dependency to 0.8.0

* 🐛 IHomieSetting::settings first, HomieNode-settings second! (homieiot#335)

*  IHomieSetting::settings first, HomieNode-settings second!

Set initialization priority of IHomieSetting::settings to value of highest allowed priority (101).

This allows other static variables to be of type HomieSetting<T>.

See https://github.com/euphi/HomieNodeCollection/blob/master/src/RGBWNode.cpp for example.

Note: As shown in the example it makes sense to have a static HomieSetting member, if you have a class that may be instantiated multiple times. (e.g. two LED Strips connected to the same ESP8266, as shown in https://github.com/euphi/ESP-LEDCtrl).

* Removed extra whitespaces

* Removed another whitespace

* 🐛 Fix not returning a value in setConfigurationApPassword (homieiot#378)

`HomieClass& HomieClass::setConfigurationApPassword(const char* password)` did not returned a reference to the Homie instance. This PR fixes this.

* 🐛 Fix crash when starting up without any defined node (homieiot#379)

* Fix crash when starting up without any defined node

Skip node publication if HomieNode::nodes.size() == 0

* Update BootNormal.cpp

* 🐛 Fix topic check for OTA upload (homieiot#375)

* Change topic check for OTA upload

* Change firmware topics to remove the 's'

* 🐛 Rename last OTA topic instance

* ✨ Add OTA updater script (homieiot#384)

* 🐍Add python ota updater script

* 💼 Update documentation of ota update script

* 😑 Add comments to ota updater script

* 🔮 Use 127.0.0.1:1883 as default broker setting

For the ota updater script

* 📝 Add details on how to interact with range property (homieiot#393)

Add hint to help people figure out how to interact with
range properties. Especially document the `_` separator.

* 📝 Add warning to input-handlers.md about concurrency (homieiot#400)

* 🐛 Interpret firmware file as an bytearray (homieiot#403)

Fix homieiot#397

* 📝 Update input-handlers.md (homieiot#401)

* Update input-handlers.md

Sorry, it seems that I used the `!!! warning` block in a wrong way.

* Update input-handlers.md

* Update input-handlers.md

* ✨ Use AsyncWebServer + Refactoring (homieiot#425)

* Initial AsyncWebServer

* Fixed Proccessing Body Requests (JSON)

* Doc Fixes + typo

* Added Missing Method in Timer.cpp

* Lots of Refactoring + Moved Reset Button to its own helper class for boots to use

* 🎨 Update BootNormal.cpp (homieiot#426)

Solve "else" errors from https://circleci.com/gh/marvinroger/homie-esp8266/136#tests/containers/0

* ⬆️ Update dependency to "ESP Async WebServer" (homieiot#434)

* 📝 Add instruction for @platformio (homieiot#435)

* Instruction for @platformio

* Explain how to used tagged version with @platformio

* Explain how to use tagged version with @platformio

* 🐛 Use v2.0.0-beta.2 as a working tagged example for @platformio (homieiot#437)

* 🎨 Simplify CI with @platformio (homieiot#438)

* 🐛 Install library via @platformio with all dependencies (homieiot#439)

* 💚 CI: Install staging version of Arduino Core for ESP8266 & @platformio (homieiot#440)

* 🐛 Pin Shutters dep version

* 🔥 Don't fail on CI docs step when testing a fork

* 🎨 More Refactoring + Deep Sleep + Prevent WiFi Reconnect when reboot (homieiot#432)

* Initial AsyncWebServer

* Fixed Proccessing Body Requests (JSON)

* Doc Fixes + typo

* Added Missing Method in Timer.cpp

* Lots of Refactoring + Moved Reset Button to its own helper class for boots to use

* Lots of Refactoring.

* Refactored ResetHandler + More Refactoring + Testing

* Added deep sleep function + Prevent Wifi reconnect before reboot (homieiot#380)[homieiot#380]

* Small typo fixes

* Small Rearange of code in BootConfig + Minor Refactor of function names

* Minior Commit to Triger a Github Action

* circleci build fix

* 🎨 First pass to fix linting

* 🎨 Second pass of lint

* 🎨 Final lint fix

* 📝 Update docs deps

* 📝 🎨 Update docs manifest

* 📝 🎨 Adjust HTTP JSON API doc

* 📝 Add configurators on website

* 📝 Update links to configurator

* Fix links under Features (homieiot#452)

* Set Device Stats Interval (homieiot#451) (homieiot#455)

* Add last step to uibundle README (homieiot#460)

* Add last step to uibundle README

* Added Arduino Support for doc

* Update updater script addressing quirks (homieiot#461)

* Update API for /wifi/connect from GET to PUT (homieiot#468)

Docs show `/wifi/connect` as `GET` when it should actually be `PUT`

* Proposal to optionally run HomieNode::loop() also in disconnected state

* Show Homie version

* Fix homieiot#446 homieiot#477 (homieiot#501)

* Fix homieiot#446 CORS Issue

* Fix for homieiot#477

* Fix Lint

* Fix Safari not displaying the config bundle HTML page (Fix homieiot#476) (homieiot#502)

* Fix Safari not displaying the config bundle HTML page

Safari cannot deal with gzip files that have a "*.gz" file extension.
Simply faking the filename solves the problem though.

* Update Readme Homie Version

* Fix Warnings (homieiot#503)

* Update Readme - Homie Convention

* Update to Homie Convention v2.0.1 (homieiot#507)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants