Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition issues in InstanceManager Get implementation. #248

Closed
santriseus opened this issue Jan 14, 2021 · 1 comment · Fixed by #251
Closed

Possible race condition issues in InstanceManager Get implementation. #248

santriseus opened this issue Jan 14, 2021 · 1 comment · Fixed by #251
Labels
bug Something isn't working

Comments

@santriseus
Copy link
Contributor

Hi, recently I investigated why all our data sources of a specific type stop working for a while when we adding one more data source with incorrect parameters and found following issues while reviewed InstanceManager Get implementation:

Issue 1. We could create instance which will never be properly disposed.
Issue 2. Dispose method of instance should be thread-safe (have not found any mentioning of that)
Issue 3. If InstanceFactoryFunc runs long for some reason (for example it creates a network connection and hitting some timeout in 30 seconds) all other data source instances will be blocked for that time

Lets review the implementation

func (im *instanceManager) Get(pluginContext backend.PluginContext) (Instance, error) {
cacheKey, err := im.provider.GetKey(pluginContext)
if err != nil {
return nil, err
}
im.rwMutex.RLock()
ci, ok := im.cache[cacheKey]
im.rwMutex.RUnlock()
if ok {
needsUpdate := im.provider.NeedsUpdate(pluginContext, ci)
if !needsUpdate {
return ci.instance, nil
}
if disposer, valid := ci.instance.(InstanceDisposer); valid {
disposer.Dispose()
}
}
im.rwMutex.Lock()
defer im.rwMutex.Unlock()
instance, err := im.provider.NewInstance(pluginContext)
if err != nil {
return nil, err
}
im.cache[cacheKey] = CachedInstance{
PluginContext: pluginContext,
instance: instance,
}
return instance, nil
}

We could see that Dispose call in not inside the Lock section, and it could be called in parallel. So it should be a thread-safe inside. (Issue 2)

Also we could have a situation when there is 2 threads getting the instance from cache and then detecting that instance should be updated.
While first thread will be creating a new instance, adding it to the cache and returning it to the caller, the second thread will waiting in the


When first thread release the lock section, the second thread will do the same: create instance, add to cache, return to caller.
As a result: instance created from the first thread will not be in cache and as a result will never be disposed. (Issue 1)

Also since only one RWMutex is used for synchronizing all the instances in cache, any slowness in the

instance, err := im.provider.NewInstance(pluginContext)
call will lead to global locking of all the processing, since other threads cant get the instance for processing the requests. (Issue 3)
In a real world it could lead to following scenario: someone from the team broke the settings of one datasource, which cause it to stuck during the creation. As a result all other datasources of this type stopped working.
Maybe it makes sense to use named RWMutex implementation like https://github.com/enfipy/locker and use datasource id as a key.

@santriseus santriseus added the bug Something isn't working label Jan 14, 2021
@marefr
Copy link
Contributor

marefr commented Jan 14, 2021

@santriseus thanks. Interesting findings. You seem to have quite good idea of what's missing here. Would you be interested in providing a suggested fix/pull request?

marefr added a commit that referenced this issue Feb 2, 2021
This changes how InstanceManager creates, caches and disposes of instances.

Added tests which could clarify the issues reported in bug. Without the changes 
here this tests should fail. 
I put the Dispose and code for checking the value existence in the lock section. 
To optimize it (99.999...% of calls will be just getting the value from cache and 
returning to the caller after config check) I implemented the double-check locking.
I have replaced simple sync.RWMutex with implementation of named RWMutex, 
which allow to process instances with different id in parallel without locking.
Since I replaced RWMutex with Named RWMutex we starting to have a situation 
with parallel read/write call to map with cached instances (for example read 
instance id#1 and write instance id#2), so I had to replace map with sync.Map

Fixes #248

Co-authored-by: Andrei Shamanau <andrei.shamanau@softeq.com>
Co-authored-by: Marcus Efraimsson <marcus.efraimsson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants