-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup branch: PMON #37
Conversation
This was tested and confirmed to be working on 12/16. Video recording: Log file: There's still a bug where one or more units briefly fail to respond in time, despite increasing the fault tolerance at various levels ( I do not think this is a pressing problem though, based on how quickly recovers from the lack of response. The problem doesn't seem to be isolated to any particular unit. My hunch is that we're pushing an aggressive polling cycle that involves multiple reads (status, process_value) every time and occasionally the the DP16 units are polled in an unprepared state, but like I said, I'm not too concerned about this as an immediate issue. |
I mentioned this in our meeting on 12/20, but I think there's a problem with how the The temperatures are updating slower than they should. This behavior can be seen in the video I posted in PR#34. The with self.modbus_lock: # Lock held for all units
for unit in self.unit_numbers:
self._poll_single_unit(unit) # Each unit can take up to 0.5s before timing out
time.sleep(0.1) # single sleep at end The I'm going to pursue some changes to acquire and release modbus_lock between each unit, and try to track the actual time spent polling so that we're not making unnecessarily long |
…t read. Simplify error paths
…BEAM_dashboard into bugfix/cleanup-PMON
Ian C. ran several tests on 1/20 which confirmed that these recent changes mask the spurious Modbus disconnection issue that we've been dealing with. The consequence of this is a 2.5 minute delay period after disconnecting the PMON system where the GUI retains the last known good readings (before accurately reflecting the disconnected state). This is secondary to the main utility of the PMON subsystem and can be improved in a follow-up PR. Attached is Ian's log file illustrating clean temperature reads, and handling several disconnection events. |
Nice! Why is there a 2.5 min delay after disconnecting? Is that unique to unplugging the unit vs getting enough errors? That is, under what condition will the bars turn grey again? 2.5 min for showing disconnection is a long time |
The 2.5min delay occurs after unplugging because this driver shows the last good reading until we receive enough errors to exceed MAX_ERROR_THRESHOLD. This threshold is set at 30, so it's requiring 30 failed poll attempts before it shows the grey disconnected state. With the base 500ms delay, this originally would have been 30x500ms=15seconds, but this delay is modified by an exponential backoff method (adjust_update_interval) in Is 2.5 minutes too long? We can modify this backoff to make it time out faster too. I'm a bit concerned this could be a trade-off situation with the modbus runtime errors though |
Yeah I think 2.5 min is too long to show disconnection. In reality for a given monitor it gives very frequent correct messages. Even if you poll fast at 200 ms it is at least 50/50 with good vs bad messages so waiting 2.5 min is far too long. How are you counting the 30 and how are they cleared? Is it consecutive 30? Given that errors occur all the time, you probably shouldn't just count up errors because you will get there very quickly. You would want to say for a given monitor if you have had x number of errors in a row without a valid read, or x amount of time without a valid read. It shouldn't be that many number of failed attempts or time, probably 10 fails in a row or maybe 10 sec of no response? |
Definitely some valid points. There's baggage left in this from earlier debug efforts. The driver uses a threshold of 30 consecutive errors to decide if a device is disconnected. The subsystem class has a separate backoff that increases the polling interval once it starts seeing these errors. The combo of slower polling and high threshold make this persist for too long. I'll try to simplify this to a single consecutive error threshold to flag on true disconnect. |
This PR is for misc cleanup and changes related to the
ProcessMonitorSubsystem