Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Motors: Position getting reassigned without being asked to - Reproduce this #2180

Closed
KathrynBaker opened this issue Mar 13, 2017 · 23 comments
Closed

Comments

@KathrynBaker
Copy link
Member

KathrynBaker commented Mar 13, 2017

During commissioning on ZOOM a couple of times it has been noticed (and is the case at the moment), that the 'position' of an axis has been altered.

The axes on ZOOM have not been moved, they are all at their forward limit, they should all be reading 0 (or close to 0), but MTR0204 and MTR0208 have had their position jump to -5.42 and -4.22 respectively, during a number of restarts of the IOC and config changes.

This is a potentially catastrophic behaviour for a running experiment, so time should be taken to see if this can be reproduced (ZOOM could be used for this at the moment), and then see if a resolution can be found.

This ticket should be used to see if this can be reliably reproduced, after which a second ticket can be written to see if it can be avoided, or if the resolution is obvious once reproduction of the issue is possible, then a ticket to fix it can be written.

@GDH-ISIS
Copy link

During commissioning, I gave the axes names in the EPICS "world" (.DESC field). For Galil_02, these names appear to have been changed and are no longer referring to the beam line motion control slits. (This may have been related to getting the globals.txt file operational on ZOOM). It would be good to appreciate why this has happened as we are going to lose our axis identifiers for driving the beam line.
I believe Galil_01 and Galil_03 axis names appear to be fine.

@KathrynBaker
Copy link
Member Author

I have so far spent 2 and a half hours actively trying to reproduce this bug, and so far I have been unable to do so, I will try again once more on another date

@kjwoodsISIS kjwoodsISIS added this to the SPRINT_2017_03_09 milestone Mar 14, 2017
@kjwoodsISIS
Copy link
Contributor

Similar problem reported on LARMOR:

From: Dalgliesh, Robert (STFC,RAL,ISIS)
Sent: 14 March 2017 22:00
To: Howells, Gareth (STFC,RAL,ISIS); Akeroyd, Freddie (STFC,RAL,ISIS); Woods, Kevin (Tessella,RAL,ISIS)
Cc: Washington, Adam (STFC,RAL,ISIS); Nilsen, Goran (STFC,RAL,ISIS); Stewart, Ross (STFC,RAL,ISIS)
Subject: Larmor bench limits

Hi,
We have been seeing some very odd problems with the Larmor rotating bench over the past couple of days.

The limit of the bench seems to not be functioning properly and keeps changing.
After determining a safe limit on Monday I set this in EPICS. However, once we tried to move back to the limit it turned out that we had to add 36deg. To this number to make it work.
Again today we moved to this limit and found that we had to add a further r2 degrees to the limit.
The bench itself seems to be moving reproducibly but this limit behaviour is both very odd and has wasted about 4 hours of beam time to date.

Gareth seems to think that it might be a similar problem to the ones seen on IMAT.

Rob

@John-Holt-Tessella
Copy link
Contributor

Had a very quick look through the patches in the latest release the only two which seem like they might be connected are: #1847 and #1990. @FreddieAkeroyd did we update epics base as well?

@kjwoodsISIS
Copy link
Contributor

Reply to Rob:

From: Woods, Kevin (Tessella,RAL,ISIS)
Sent: 15 March 2017 10:49
To: Dalgliesh, Robert (STFC,RAL,ISIS); Howells, Gareth (STFC,RAL,ISIS); Akeroyd, Freddie (STFC,RAL,ISIS)
Cc: Washington, Adam (STFC,RAL,ISIS); Nilsen, Goran (STFC,RAL,ISIS); Stewart, Ross (STFC,RAL,ISIS)
Subject: RE: Larmor bench limits

Hi Rob,

We are investigating the problem right now. We agree – this behaviour is very odd.

We’d like to try and pin-point when the problem happened – to see if we can spot anything in the logs.
You say you set a safe limit on Monday. How soon afterwards did you notice the problem? Was it immediate, or was there a delay?
Similarly, you moved the bench to a limit yesterday, but then had to add two extra degrees. Did you notice this problem immediately, or only after a period of working?

Kevin

@John-Holt-Tessella
Copy link
Contributor

Extra info: The problem on ZOOM was on galil 02. It was spotted on Monday(13/3/2017) and was fine on Friday. It effected all descriptions and the position reported by the 04 and 08. Labview was watching it and it also reported a change in the reported position so it appears to be something telling the motor itself which is effected. The motors have not moved.
This could be the same as the problem reported by Larmor in that if the position being reported is offset then the limits will not have been changed for the new offset.

Speculation: It may be something to do with autosave not being applied properly. There is a report in the log (on zoom ...Var\logs\ioc\GALIL_02-20170313.log) which says:

[2017-03-13 17:03:42] sevr=info *** restoring from 'C:/Instrument/Var/autosave/GALIL_02/GALIL_02_settings.sav' at initHookState 6 (before record/device init) ***

[2017-03-13 17:03:42] dbFindRecord for 'IN:ZOOM:MOT:MTR0201.DIR' failed

[2017-03-13 17:03:42] dbFindRecord for 'IN:ZOOM:MOT:MTR0201.DHLM' failed

But this seems late for a problem and I can not find a similar thing in LARMOR, although I only had a breif look.

2 other odditites:

  • Limit switches are not reporting correctly on ZOOM - they should all be at their limits

  • Autosave settings have weird EGU and are (...Var\autosave\GALIL_01\GALIL_01_settings.sav_170309-152013)

    IN:ZOOM:MOT:MTR0101.DESC PGC
    IN:ZOOM:MOT:MTR0101.EGU mmcaput IN:ZOOM
    IN:ZOOM:MOT:MTR0101.RTRY 10
    

@FreddieAkeroyd
Copy link
Member

@John-Holt-Tessella I would be very surprised if #1847 or #1990 caused a problem, #1847 just changed an access security group setting to stop excessive logging, #1990 defined an alternative move command which in the worse case might stop a move happening if it got set to something accidentally but would not cause a random move by itself.

@FreddieAkeroyd
Copy link
Member

Can I just understand "Labview was watching it and it also reported a change in the reported position so it appears to be something telling the motor itself which is effected. The motors have not moved." The motors are reporting a change in position but no change has actually taken place?

@KathrynBaker
Copy link
Member Author

KathrynBaker commented Mar 15, 2017 via email

@AdrianPotter
Copy link

From analysing the times at which the autosave error John identified happens, I can see at those times they are trying to save an almost empty autosave file:

# autosave R5.3	Automatically generated - DO NOT MODIFY - 170309-143803
<END>

The file has no records which is why the load fails. Since the load fails, the motor starts with its default settings. This will cause the description and position to appear to change:

  1. The description will set to its default which will obviously be different
  2. Although the motor won't have moved, the MRES field will default meaning the motor position is scaled to a different actual position.

I've been thus far unable to reproduce the error. I've tried restarting the IOC many times, interrupting it at multiple stages of the startup process. I've tried changing the file permissions on the auto save file, and removing the live version entirely. None of my attempts have worked.

Notably I've scanned Larmor and it doesn't appear to have the same issue. I've scanned my own machine and it looks like it happened a couple of times. The only IOCs it will affect are those that create an autosave monitor, e.g.:



# Save motor positions every 5 seconds
create_monitor_set("$(IOCNAME)_positions.req", 5, "P=$(MYPVPREFIX)MOT:,IFDMC01=$(IFDMC01),IFDMC02=$(IFDMC02),IFDMC03=$(IFDMC03),IFDMC04=$(IFDMC04),IFDMC05=$(IFDMC05),IFDMC06=$(IFDMC06),IFDMC07=$(IFDMC07),IFDMC08=$(IFDMC08),IFDMC09=$(IFDMC09),IFDMC10=$(IFDMC10)")

# Save motor settings every 30 seconds
create_monitor_set("$(IOCNAME)_settings.req", 30, "P=$(MYPVPREFIX)MOT:,IFDMC01=$(IFDMC01),IFDMC02=$(IFDMC02),IFDMC03=$(IFDMC03),IFDMC04=$(IFDMC04),IFDMC05=$(IFDMC05),IFDMC06=$(IFDMC06),IFDMC07=$(IFDMC07),IFDMC08=$(IFDMC08),IFDMC09=$(IFDMC09),IFDMC10=$(IFDMC10)")

There are only a couple of IOCs that do that and the Galil is by far the most common.

@AdrianPotter
Copy link

The issue can be reproduced by starting the IOC once without the GALIL_02__GALILADDR02=192.168.1.202 macro. If the problem is not in autosave, then perhaps the macro was not read correctly. This would explain those times it happened on my own machine (I often forget to make sure the macro is set before starting the IOC). It doesn't explain why it has happened on ZOOM.

@KathrynBaker
Copy link
Member Author

KathrynBaker commented Mar 15, 2017 via email

@AdrianPotter
Copy link

Thanks for the info. I'll record the findings somewhere in the troubleshooting wiki. Not sure about Larmor's limits. I'll take a look at their logs but it might be a separate ticket.

@AdrianPotter
Copy link

@kjwoodsISIS
Copy link
Contributor

Is there anything we can do to minimise the risk of this type of thing happening again? For example, if default values for all the macros were provided in our standard settings, that might reduce the chances of the macros being left unset (or does providing default macro values create problems of its own?). Other suggestions?

@KathrynBaker
Copy link
Member Author

KathrynBaker commented Mar 15, 2017 via email

@kjwoodsISIS
Copy link
Contributor

"providing default values in there for all items might not be a wise move", that may be true, but providing no value for some items is also unwise (as we have just experienced).
Since putting globals.txt at the wrong level was also a contributory factor, what can we do to make it easier to put it at the right level (or harder to put it at the wrong level)?

@AdrianPotter
Copy link

Marking this for review. Just check my troubleshooting explanation makes sense. From a user perspective, the GALILADDR shouldn't be changed and they shouldn't be doing anything to globals.txt after the instrument is set up. This is likely only a problem the Ibex team will meet.

@GDH-ISIS
Copy link

May I request that this be discussed a little further. I have seen Rob on LARMOR changing globals.txt on numerous occasions in the past. I think we should be cautious with the assumption that it will remain static.

@GDH-ISIS
Copy link

Regarding this ticket, have we not got two problems intertwined. This ticket relates to restarting a Galil IOC and on restart, the read back of an axis being incorrect - no move being made or requested. I believe this has been seen on ZOOM and IMAT. I believe IMAT's issue has been resolved (with luck)(https://github.com/ISISComputingGroup/ControlsWork/issues/162)

(There is another ticket in IBEX relating to limit issues #2184 and another reference in the Controls work https://github.com/ISISComputingGroup/ControlsWork/issues/150 - this I believe has been seen on LARMOR and IMAT)

@Tom-Willemsen
Copy link
Contributor

Tom-Willemsen commented Mar 27, 2017

r.e. @GDH-ISIS comment above: is this ticket actually ready for review yet?

Have all the relevant discussions been had about instrument scientists changing globals.txt? If so, is the result of that discussion documented anywhere?

i.e. are instrument scientists meant to be modifying globals.txt or not? If so, how have we mitigated the risk of this issue reoccuring? If not, then how can Rob and others(?) maintain the workflows they need?

The instructions on the troubleshooting page of the dev wiki seem fine to me but I just want to confirm that it's a sufficient solution to this ticket.

@Tom-Willemsen
Copy link
Contributor

Tom-Willemsen commented Mar 28, 2017

Having just discussed this briefly with @KathrynBaker I will pull the above questions out into a seperate ticket (#2204) and mark this one as complete.

@FreddieAkeroyd
Copy link
Member

I think the issue here is not globals.txt related - it is the galil IOC corrupting the rest of its otherwise valid autosave settings if it is not given an IP address. It is strange the file is nearly blank, I would have more suspected a file of incorrect values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants