You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, recently I investigated why all our data sources of a specific type stop working for a while when we adding one more data source with incorrect parameters and found following issues while reviewed InstanceManager Get implementation:
Issue 1. We could create instance which will never be properly disposed.
Issue 2. Dispose method of instance should be thread-safe (have not found any mentioning of that)
Issue 3. If InstanceFactoryFunc runs long for some reason (for example it creates a network connection and hitting some timeout in 30 seconds) all other data source instances will be blocked for that time
We could see that Dispose call in not inside the Lock section, and it could be called in parallel. So it should be a thread-safe inside. (Issue 2)
Also we could have a situation when there is 2 threads getting the instance from cache and then detecting that instance should be updated.
While first thread will be creating a new instance, adding it to the cache and returning it to the caller, the second thread will waiting in the
When first thread release the lock section, the second thread will do the same: create instance, add to cache, return to caller.
As a result: instance created from the first thread will not be in cache and as a result will never be disposed. (Issue 1)
Also since only one RWMutex is used for synchronizing all the instances in cache, any slowness in the
call will lead to global locking of all the processing, since other threads cant get the instance for processing the requests. (Issue 3)
In a real world it could lead to following scenario: someone from the team broke the settings of one datasource, which cause it to stuck during the creation. As a result all other datasources of this type stopped working.
Maybe it makes sense to use named RWMutex implementation like https://github.com/enfipy/locker and use datasource id as a key.
The text was updated successfully, but these errors were encountered:
@santriseus thanks. Interesting findings. You seem to have quite good idea of what's missing here. Would you be interested in providing a suggested fix/pull request?
This changes how InstanceManager creates, caches and disposes of instances.
Added tests which could clarify the issues reported in bug. Without the changes
here this tests should fail.
I put the Dispose and code for checking the value existence in the lock section.
To optimize it (99.999...% of calls will be just getting the value from cache and
returning to the caller after config check) I implemented the double-check locking.
I have replaced simple sync.RWMutex with implementation of named RWMutex,
which allow to process instances with different id in parallel without locking.
Since I replaced RWMutex with Named RWMutex we starting to have a situation
with parallel read/write call to map with cached instances (for example read
instance id#1 and write instance id#2), so I had to replace map with sync.Map
Fixes#248
Co-authored-by: Andrei Shamanau <andrei.shamanau@softeq.com>
Co-authored-by: Marcus Efraimsson <marcus.efraimsson@gmail.com>
Hi, recently I investigated why all our data sources of a specific type stop working for a while when we adding one more data source with incorrect parameters and found following issues while reviewed InstanceManager Get implementation:
Issue 1. We could create instance which will never be properly disposed.
Issue 2. Dispose method of instance should be thread-safe (have not found any mentioning of that)
Issue 3. If InstanceFactoryFunc runs long for some reason (for example it creates a network connection and hitting some timeout in 30 seconds) all other data source instances will be blocked for that time
Lets review the implementation
grafana-plugin-sdk-go/backend/instancemgmt/instance_manager.go
Lines 78 to 112 in f39bb5f
We could see that Dispose call in not inside the Lock section, and it could be called in parallel. So it should be a thread-safe inside. (Issue 2)
Also we could have a situation when there is 2 threads getting the instance from cache and then detecting that instance should be updated.
While first thread will be creating a new instance, adding it to the cache and returning it to the caller, the second thread will waiting in the
grafana-plugin-sdk-go/backend/instancemgmt/instance_manager.go
Line 99 in f39bb5f
When first thread release the lock section, the second thread will do the same: create instance, add to cache, return to caller.
As a result: instance created from the first thread will not be in cache and as a result will never be disposed. (Issue 1)
Also since only one RWMutex is used for synchronizing all the instances in cache, any slowness in the
grafana-plugin-sdk-go/backend/instancemgmt/instance_manager.go
Line 102 in f39bb5f
In a real world it could lead to following scenario: someone from the team broke the settings of one datasource, which cause it to stuck during the creation. As a result all other datasources of this type stopped working.
Maybe it makes sense to use named RWMutex implementation like https://github.com/enfipy/locker and use datasource id as a key.
The text was updated successfully, but these errors were encountered: