Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPC-UA input plugin collection interval changes after bad quality #10665

Closed
sjwang90 opened this issue Feb 16, 2022 · 12 comments
Closed

OPC-UA input plugin collection interval changes after bad quality #10665

sjwang90 opened this issue Feb 16, 2022 · 12 comments
Labels
area/opcua bug unexpected problem or unintended behavior platform/windows

Comments

@sjwang90
Copy link
Contributor

Relevant telegraf.conf

[agent]
    interval = "1s"
    debug = false
    round_interval = true
    flush_interval = "1s"
    flush_jitter = "0s"
    collection_jitter = "0s"
    metric_batch_size = 1000
    metric_buffer_limit = 100000
    quiet = true

[[inputs.opcua]]

endpoint = "opc.tcp://127.0.0.1:12345"

connect_timeout = "60s"

security_policy = "auto"
security_mode = "auto"

#certificate = "C:/Source/telegraf/telegraf-selfsigned.crt"

#private_key = "C:/Source/telegraf/telegraf-selfsigned.key"

#auth_method = "Certificate"

## Entire configuration includes 42 opcua.groups with 21 nodes each
[[inputs.opcua.group]]
        name="anodic_current"
        namespace="2"
        identifier_type="s"
                tags=[["device","1001"]]

nodes = [
        {name="current", identifier="ABCD_1234_RA.ID.I01", data_type="float",tags=[["anode","01"]],description="Current Anode 01 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I02", data_type="float",tags=[["anode","02"]],description="Current Anode 02 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I03", data_type="float",tags=[["anode","03"]],description="Current Anode 03 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I04", data_type="float",tags=[["anode","04"]],description="Current Anode 04 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I05", data_type="float",tags=[["anode","05"]],description="Current Anode 05 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I06", data_type="float",tags=[["anode","06"]],description="Current Anode 06 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I07", data_type="float",tags=[["anode","07"]],description="Current Anode 07 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I08", data_type="float",tags=[["anode","08"]],description="Current Anode 08 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I09", data_type="float",tags=[["anode","09"]],description="Current Anode 09 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I10", data_type="float",tags=[["anode","10"]],description="Current Anode 10 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I11", data_type="float",tags=[["anode","11"]],description="Current Anode 11 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I12", data_type="float",tags=[["anode","12"]],description="Current Anode 12 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I13", data_type="float",tags=[["anode","13"]],description="Current Anode 13 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I14", data_type="float",tags=[["anode","14"]],description="Current Anode 14 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I15", data_type="float",tags=[["anode","15"]],description="Current Anode 15 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I16", data_type="float",tags=[["anode","16"]],description="Current Anode 16 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I17", data_type="float",tags=[["anode","17"]],description="Current Anode 17 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I18", data_type="float",tags=[["anode","18"]],description="Current Anode 18 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I19", data_type="float",tags=[["anode","19"]],description="Current Anode 19 device 1001"},
		{name="current", identifier="ABCD_1234_RA.ID.I20", data_type="float",tags=[["anode","20"]],description="Current Anode 20 device 1001"},
        {name="total_current", identifier="ABCD_1234_RA.ID.I_Total", data_type="float",description="Current Total device 1001"},
		
]

Logs from Telegraf

2022-01-19T18:27:21Z D! [agent] Starting service inputs
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:22Z E! [inputs.opcua] status not OK for node total_current:  Bad (0x80000000)
2022-01-19T18:27:22Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
2022-01-19T18:27:23Z D! [inputs.opcua] Previous collection has not completed; scheduled collection skipped
2022-01-19T18:27:23Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 1s
2022-01-19T18:27:23Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
2022-01-19T18:27:24Z D! [inputs.opcua] Previous collection has not completed; scheduled collection skipped
2022-01-19T18:27:24Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 1s
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node current:  Bad (0x80000000)
2022-01-19T18:27:24Z E! [inputs.opcua] status not OK for node total_current:  Bad (0x80000000)
2022-01-19T18:27:24Z D! [outputs.influxdb] Wrote batch of 1000 metrics in 20.0946ms

System info

Telegraf 1.21, Operating system: Windows Server 2019, Memory Available (GB): 6GB CPU cores Available: unclear, but at least 2 Disks (HDD/SSD/etc..): 50GB, OPC-UA Server: Kepware KEPServer EX 5.21

Docker

No response

Steps to reproduce

  1. Run Telegraf connecting to 40+ group, 20+ nodes
  2. Get status not OK for node current: Bad (0x80000000) error
  3. Data only collects in 3s intervals
    ...

Expected behavior

Running with an agent interval of 1s. Everything works fine until one or more nodes comes back with a BAD Quality (0x80000000) value. Once a BAD Quality node is observed, the acquisition interval changes to 3s.

Actual behavior

Acquisition should remain at 1s as specified in agent interval

Additional info

No response

@sjwang90 sjwang90 added the bug unexpected problem or unintended behavior label Feb 16, 2022
@sjwang90
Copy link
Contributor Author

sjwang90 commented Feb 16, 2022

To do: Need to determine what does the Bad (0x80000000) mean that's causing this cascade of issues. This python opcua client bug has the same error comes up and it takes multiple seconds for the error to show up.

Other Notes:

  1. The warning messages about "Collection took longer than expected; not complete after interval of 1s" are the cause of the collection changing from every second to every three seconds. The interval of one second does not complete in time, because the input is still waiting on the output possibly due to the bad value, as a result the collection interval is skipped until things clear up. This results in intervals of three seconds that the customer is seeing.

  2. The "status not OK for node current: Bad (0x80000000)" error is from the OPC UA plugin attempting to read from their device and getting 'The value is bad but the reason is unknown'.

@sjwang90
Copy link
Contributor Author

sjwang90 commented Feb 16, 2022

@magiconair Would you have any insight specifically into the Bad (0x80000000) error and why it could be causing the behavior outlined in the bug report?

To do: Need to determine what does the Bad (0x80000000) mean that's causing this cascade of issues. This FreeOpcUa/python-opcua#1004 has the same error comes up and it takes multiple seconds for the error to show up.

@magichair
Copy link
Contributor

@sjwang90 this is outside of my area of contribution, sorry. I have no experience with opcua. 🙇

@sjwang90
Copy link
Contributor Author

Oops sorry tagged the wrong person!

@sjwang90 this is outside of my area of contribution, sorry. I have no experience with opcua. 🙇

@R290
Copy link
Contributor

R290 commented Feb 23, 2022

What setting do you use for request_timeout?

@powersj
Copy link
Contributor

powersj commented Feb 23, 2022

What setting do you use for request_timeout?

I do not believe the user has it set. Is there a suggested value?

I have wondered about the interval being 1 second as well if there is any delay and if that is causing issues.

@R290
Copy link
Contributor

R290 commented Feb 23, 2022

The default value is 5s but I'd say it always should be less than the collection interval. No clue if this is actually part of the cause for the problem. @sjwang90 could you retry while setting request_timeout = "500ms"?

@powersj
Copy link
Contributor

powersj commented Feb 25, 2022

This is what started showing up in the logs:

2022-02-25T14:21:50Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 1s
2022-02-25T14:21:50Z D! [inputs.opcua] Previous collection has not completed; scheduled collection skipped
2022-02-25T14:21:50Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
2022-02-25T14:21:50Z E! [inputs.opcua] Error in plugin: get Data Failed: RegisterNodes Read failed: The operation timed out. StatusBadTimeout (0x800A0000)
2022-02-25T14:21:51Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
2022-02-25T14:21:52Z D! [inputs.opcua] Previous collection has not completed; scheduled collection skipped
2022-02-25T14:21:52Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 1s
2022-02-25T14:21:52Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
2022-02-25T14:21:52Z E! [inputs.opcua] Error in plugin: get Data Failed: RegisterNodes Read failed: The operation timed out. StatusBadTimeout (0x800A0000)
2022-02-25T14:21:53Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
2022-02-25T14:21:54Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 1s
2022-02-25T14:21:54Z D! [inputs.opcua] Previous collection has not completed; scheduled collection skipped
2022-02-25T14:21:54Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
2022-02-25T14:21:54Z E! [inputs.opcua] Error in plugin: get Data Failed: RegisterNodes Read failed: The operation timed out. StatusBadTimeout (0x800A0000)

The user also said that anything with a timeout was under 4 seconds and Telegraf didn't appear to get the data.

Does this mean the device may not be able to be queried every second?

@R290
Copy link
Contributor

R290 commented Feb 27, 2022

What I understand from the comments on the Python bug mentioned by sjwang90 is that the kepware server does not return after 3-4 seconds when a bad quality node is part of the registered read request. Currently we do not support subscriptions or events which is the solution mentioned (#8083). For now, a hacky work-around could be to split the node groups between several opcua input plugins. Only the plugin with the bad read request would drop a few values while the others can continue at the once per second rate.

@powersj
Copy link
Contributor

powersj commented Feb 28, 2022

bad quality node is part of the registered read request

Is this the result of the sensor itself having a bad value? Is there something the OPC UA client should do to avoid the delay?

For now, a hacky work-around could be to split the node groups between several opcua input plugins. Only the plugin with the bad read request would drop a few values while the others can continue at the once per second rate.

I agree, without an event-based plugin, this does seem like the right way forward for now.

@R290
Copy link
Contributor

R290 commented Mar 1, 2022

Is this the result of the sensor itself having a bad value? Is there something the OPC UA client should do to avoid the delay?

I indeed think the sensor itself is assigned a bad quality. If the server takes this long to reply there is not much we can do as a client. A Wireshark capture or similar could provide definitive evidence that the server is causing the delay.

@magiconair
Copy link

@sjwang90 not really. This error vaguely rings a bell with the Kepware server depending on how it is hooked up to the backend data collection but it has been a looong time I've played with Kepware. Try to use the go run ./examples/read/read.go from the https://github.com/gopcua/opcua repo to see if you can read the node at all.

@powersj powersj closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/opcua bug unexpected problem or unintended behavior platform/windows
Projects
None yet
Development

No branches or pull requests

5 participants