Cleanup branch: PMON #37

mslaffin · 2024-12-15T20:56:51Z

This PR is for misc cleanup and changes related to the ProcessMonitorSubsystem

…secutive failed readings

mslaffin · 2024-12-17T00:03:52Z

This was tested and confirmed to be working on 12/16.

Video recording:
Recording 2024-12-16 174706.zip

Log file:
log_2024-12-16_17-32-58.txt

There's still a bug where one or more units briefly fail to respond in time, despite increasing the fault tolerance at various levels (_poll_single_unit, poll_all_units and the subsytem's update_temperatures). This presents itself as a brief flash of the grey "Disconnected" error bar before recovering back to the nominal colored bar

I do not think this is a pressing problem though, based on how quickly recovers from the lack of response. The problem doesn't seem to be isolated to any particular unit. My hunch is that we're pushing an aggressive polling cycle that involves multiple reads (status, process_value) every time and occasionally the the DP16 units are polled in an unprepared state, but like I said, I'm not too concerned about this as an immediate issue.

…iver readme

mslaffin · 2024-12-23T16:29:46Z

I mentioned this in our meeting on 12/20, but I think there's a problem with how the modbus_lock is implemented inside poll_all_units that could be leading to slower/less predictable update cycles.

The temperatures are updating slower than they should. This behavior can be seen in the video I posted in PR#34.

The modbus_lock is held for ALL units and is probably an inefficient way to do this. Currently looks like:

with self.modbus_lock:  # Lock held for all units
    for unit in self.unit_numbers:
        self._poll_single_unit(unit)  # Each unit can take up to 0.5s before timing out
time.sleep(0.1) # single sleep at end

The get_all_temperatures in the subsystem class returns a copy of the cached readings under response_lock, so we're always updating the GUI every 500ms, but not necessarily with new values until all units have been polled or timed out.

I'm going to pursue some changes to acquire and release modbus_lock between each unit, and try to track the actual time spent polling so that we're not making unnecessarily long time.sleep() calls

…t read. Simplify error paths

…BEAM_dashboard into bugfix/cleanup-PMON

mslaffin · 2025-01-20T22:14:31Z

Ian C. ran several tests on 1/20 which confirmed that these recent changes mask the spurious Modbus disconnection issue that we've been dealing with.

The consequence of this is a 2.5 minute delay period after disconnecting the PMON system where the GUI retains the last known good readings (before accurately reflecting the disconnected state). This is secondary to the main utility of the PMON subsystem and can be improved in a follow-up PR.

Attached is Ian's log file illustrating clean temperature reads, and handling several disconnection events.
log_2025-01-20_13-53-26.txt

bwalkerMIR · 2025-01-20T23:21:36Z

Nice! Why is there a 2.5 min delay after disconnecting? Is that unique to unplugging the unit vs getting enough errors? That is, under what condition will the bars turn grey again? 2.5 min for showing disconnection is a long time

mslaffin · 2025-01-20T23:51:39Z

The 2.5min delay occurs after unplugging because this driver shows the last good reading until we receive enough errors to exceed MAX_ERROR_THRESHOLD. This threshold is set at 30, so it's requiring 30 failed poll attempts before it shows the grey disconnected state.

With the base 500ms delay, this originally would have been 30x500ms=15seconds, but this delay is modified by an exponential backoff method (adjust_update_interval) in ProcessMonitorSubsystem so the polling interval eventually increases up to 5 seconds when we're only receiving errors. 30x5seconds = 150seconds/60seconds = 2.5 minutes.

Is 2.5 minutes too long? We can modify this backoff to make it time out faster too. I'm a bit concerned this could be a trade-off situation with the modbus runtime errors though

bwalkerMIR · 2025-01-21T00:35:02Z

Yeah I think 2.5 min is too long to show disconnection. In reality for a given monitor it gives very frequent correct messages. Even if you poll fast at 200 ms it is at least 50/50 with good vs bad messages so waiting 2.5 min is far too long. How are you counting the 30 and how are they cleared? Is it consecutive 30? Given that errors occur all the time, you probably shouldn't just count up errors because you will get there very quickly. You would want to say for a given monitor if you have had x number of errors in a row without a valid read, or x amount of time without a valid read. It shouldn't be that many number of failed attempts or time, probably 10 fails in a row or maybe 10 sec of no response?

mslaffin · 2025-01-21T01:35:43Z

Definitely some valid points. There's baggage left in this from earlier debug efforts.

The driver uses a threshold of 30 consecutive errors to decide if a device is disconnected.

The subsystem class has a separate backoff that increases the polling interval once it starts seeing these errors. The combo of slower polling and high threshold make this persist for too long.

I'll try to simplify this to a single consecutive error threshold to flag on true disconnect.

add more debug logging calls for visibility during testing

7105996

mslaffin requested review from bwalkerMIR and mark11778 December 15, 2024 20:56

mslaffin self-assigned this Dec 15, 2024

mslaffin added 4 commits December 16, 2024 17:01

sensor error too aggressive. Only indicate unit error after three con…

5023052

…secutive failed readings

adjust minimum temp scale limit. Tone back error classification

5d26943

quiet down logging

294e55d

increase missed package tolerance and clean up logs

a02629c

mslaffin and others added 5 commits December 20, 2024 15:29

updated initialization flowchart

1d3b247

updated update_temperature callback flowchart

b915cc8

working on read me for the process_monitor.py

35810c9

small changes, same as above commit ^

4d1a3b0

fixed typos in process_monitor.py readme and added imports to DP16 dr…

80a6155

…iver readme

mslaffin and others added 15 commits December 23, 2024 10:50

release modbus_lock between units and maintain BASE_DELAY

f6383e6

add poll_all_units flowchart to README

3af7d66

add DP16ProcessMonitor driver init flowchart to README

1c688e2

more small changes to the Readme documentation

27c49b5

reduce connection state check frequency

78fcb08

each unit operation now fully atomic, buffer clearing before each uni…

7e7261a

…t read. Simplify error paths

add disconnection check

0a83fac

Merge branch 'bugfix/cleanup-PMON' of https://github.com/bwalkerMIR/E…

ef2db7c

…BEAM_dashboard into bugfix/cleanup-PMON

increase inter-unit poll delay

84968a3

remove blue coloring

b1f9662

re-enable error counting

5ea2184

consolidate exception handling, reduce Modbus timeout

75a2be2

enforce rate limiting for critical polling error

89b10c0

spelling

c41ef73

remove hard fail for bad STATUS_RUNNING

abc378b

mslaffin added 6 commits January 18, 2025 18:22

fix status registers typo

241aaac

add helper method to distinguish ModbusIOExceptions

57fa339

remove redundant try-block

ade0fad

unify error handling with minor vs. major threshold

48b0e1a

fix constant syntax

40ab84b

remove problematic DISCONNECTED state assignment

64ca7ec

mslaffin merged commit 871efdf into develop Jan 20, 2025

mslaffin deleted the bugfix/cleanup-PMON branch January 20, 2025 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup branch: PMON #37

Cleanup branch: PMON #37

mslaffin commented Dec 15, 2024

mslaffin commented Dec 17, 2024

mslaffin commented Dec 23, 2024

mslaffin commented Jan 20, 2025

bwalkerMIR commented Jan 20, 2025

mslaffin commented Jan 20, 2025

bwalkerMIR commented Jan 21, 2025

mslaffin commented Jan 21, 2025

Cleanup branch: PMON #37

Cleanup branch: PMON #37

Conversation

mslaffin commented Dec 15, 2024

mslaffin commented Dec 17, 2024

mslaffin commented Dec 23, 2024

mslaffin commented Jan 20, 2025

bwalkerMIR commented Jan 20, 2025

mslaffin commented Jan 20, 2025

bwalkerMIR commented Jan 21, 2025

mslaffin commented Jan 21, 2025