Possible race condition issues in InstanceManager Get implementation. #248

santriseus · 2021-01-14T14:42:43Z

Hi, recently I investigated why all our data sources of a specific type stop working for a while when we adding one more data source with incorrect parameters and found following issues while reviewed InstanceManager Get implementation:

Issue 1. We could create instance which will never be properly disposed.
Issue 2. Dispose method of instance should be thread-safe (have not found any mentioning of that)
Issue 3. If InstanceFactoryFunc runs long for some reason (for example it creates a network connection and hitting some timeout in 30 seconds) all other data source instances will be blocked for that time

Lets review the implementation

grafana-plugin-sdk-go/backend/instancemgmt/instance_manager.go

Lines 78 to 112 in f39bb5f

    
           func (im *instanceManager) Get(pluginContext backend.PluginContext) (Instance, error) { 
        
           	cacheKey, err := im.provider.GetKey(pluginContext) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	im.rwMutex.RLock() 
        
           	ci, ok := im.cache[cacheKey] 
        
           	im.rwMutex.RUnlock() 
        
           	if ok { 
        
           		needsUpdate := im.provider.NeedsUpdate(pluginContext, ci) 
        
           		if !needsUpdate { 
        
           			return ci.instance, nil 
        
           		} 
        
           		if disposer, valid := ci.instance.(InstanceDisposer); valid { 
        
           			disposer.Dispose() 
        
           		} 
        
           	} 
        
           	im.rwMutex.Lock() 
        
           	defer im.rwMutex.Unlock() 
        
           	instance, err := im.provider.NewInstance(pluginContext) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	im.cache[cacheKey] = CachedInstance{ 
        
           		PluginContext: pluginContext, 
        
           		instance:      instance, 
        
           	} 
        
           	return instance, nil 
        
           }

We could see that Dispose call in not inside the Lock section, and it could be called in parallel. So it should be a thread-safe inside. (Issue 2)

Also we could have a situation when there is 2 threads getting the instance from cache and then detecting that instance should be updated.
While first thread will be creating a new instance, adding it to the cache and returning it to the caller, the second thread will waiting in the

grafana-plugin-sdk-go/backend/instancemgmt/instance_manager.go

Line 99 in f39bb5f

im.rwMutex.Lock()

When first thread release the lock section, the second thread will do the same: create instance, add to cache, return to caller.
As a result: instance created from the first thread will not be in cache and as a result will never be disposed. (Issue 1)

Also since only one RWMutex is used for synchronizing all the instances in cache, any slowness in the

grafana-plugin-sdk-go/backend/instancemgmt/instance_manager.go

Line 102 in f39bb5f

instance, err := im.provider.NewInstance(pluginContext)

call will lead to global locking of all the processing, since other threads cant get the instance for processing the requests. (Issue 3)
In a real world it could lead to following scenario: someone from the team broke the settings of one datasource, which cause it to stuck during the creation. As a result all other datasources of this type stopped working.
Maybe it makes sense to use named RWMutex implementation like https://github.com/enfipy/locker and use datasource id as a key.

marefr · 2021-01-14T16:46:20Z

@santriseus thanks. Interesting findings. You seem to have quite good idea of what's missing here. Would you be interested in providing a suggested fix/pull request?

This changes how InstanceManager creates, caches and disposes of instances. Added tests which could clarify the issues reported in bug. Without the changes here this tests should fail. I put the Dispose and code for checking the value existence in the lock section. To optimize it (99.999...% of calls will be just getting the value from cache and returning to the caller after config check) I implemented the double-check locking. I have replaced simple sync.RWMutex with implementation of named RWMutex, which allow to process instances with different id in parallel without locking. Since I replaced RWMutex with Named RWMutex we starting to have a situation with parallel read/write call to map with cached instances (for example read instance id#1 and write instance id#2), so I had to replace map with sync.Map Fixes #248 Co-authored-by: Andrei Shamanau <andrei.shamanau@softeq.com> Co-authored-by: Marcus Efraimsson <marcus.efraimsson@gmail.com>

santriseus added the bug Something isn't working label Jan 14, 2021

mikhail-vl mentioned this issue Jan 14, 2021

All Redis Datasource timeout when one is not reachable RedisGrafana/grafana-redis-datasource#73

Closed

santriseus mentioned this issue Jan 16, 2021

Change how InstanceManager manages instance caching #251

Merged

marefr closed this as completed in #251 Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition issues in InstanceManager Get implementation. #248

Possible race condition issues in InstanceManager Get implementation. #248

santriseus commented Jan 14, 2021

marefr commented Jan 14, 2021

Possible race condition issues in InstanceManager Get implementation. #248

Possible race condition issues in InstanceManager Get implementation. #248

Comments

santriseus commented Jan 14, 2021

marefr commented Jan 14, 2021