Skip to content
Matt Williams edited this page Jul 22, 2016 · 3 revisions

Introduction

A proposal for a splitting of the duties of GangaObject into two classes:

  1. one which is a standard in-memory object with a _data-backed store (GangaObject)
  2. one which is stored in a registry and stores all its data in the registry (RegistryObject)

RegistryObject

In all of Ganga, there are only a very small number of objects which are ever stored in registries (ignoring the box registry for now): Job, Task, and ShareRef and of those only Job is ever lazily-loaded. Nonetheless, every GangaObject needs to worry about both lazy-loading (_index_cache etc.) as well as registry membership. We propose simplifying this by creating a subclass of GangaObject which would know how to deal with these things so GangaObject (particularly its descriptor getters and setters) could be simplified to standard _dict access.

GangaObjects would never be lazy-loaded and would always have all their data stored in _data as is the case for most objects at the moment. A RegistryObject would be a very thin wrapper object which would only store an id and _registry attribute. Any call to, for example, j.status would be redirected to something like self._registry.get_attribute(self.id, 'status'). It would then be the responsibility of the registry to decide how to implement get_attribute(). For example JobRegistry would try to get the information from the cache in preference to loading the object fully but PrepRegistry would always get the session lock and load the XML from disk.

Registry changes

The registry will now be responsible for storing all the data about its items directly. So instead of having Registry._objects be a list of GangaObjects, it will instead be a table of data with three columns, object id, object cache and full object data. On initial creation of the registry it will request from the repository the cache for each of the objects it knows about and enter them into the table. When data is requested that cannot be retrieved from the cache column, the registry will will request the full object from the repository. All the repository needs to return is the data dictionary for the object (the equivalent of the current _data attribute) which the registry can then place in the correct cell in the table.

We then overload RegistryObject's Descriptor (via some simple metaclass magic) so that instead of __get__ and __set__ accessing _data, they instead call Registry.get_attribute() instead. get_attribute() will query the data table and return the appropriate information. The registry can at this point decide whether to return info from the cache or fully load the object from disk and give that data instead.

Repository changes

The repository's interface can be conceptually simplified down to three methods:

  1. return the cache for an object
  2. return the full data for an object
  3. given an object and its cache data, write it to disk

with potentially an additional method to return all caches for all objects for efficiency on first load.

This isn't dissimilar to how it works today but it this should clarify the boundaries between the classes and allow some simplification of the code.

Display cache vs. index cache

There are two main reasons to perform lazy-loading:

  1. display a summary of pertinent information in a table (i.e. typing jobs)
  2. provide a small subset of information which is provided through the standard object API (i.e. looping over jobs and doing j.status or fully loading a job without fully loading all its subjobs)

Currently the information for these is stored in the index cache and is sometimes confusingly mixed together and differentiated by mangling the names. The index cache is sometimes treated as a fall-through surrogate for the _data dictionary, even in those cases where the types do not match. This proposal suggests that the cache is always a fundamental Python object whose interpretation is solely down to the registry in question and is not implicitly treated as a simple subset of the _data dictionary.

The JobRegistry for example only needs to store information for the display function and could therefore have a cache something like:

{
    'status': 'running',
    'name': '',
    'subjobs': 10,
    'application': 'Executable',
    'backend': 'Dirac',
    'backend.actualCE': 'LCG.RAL-LCG2.uk',
    'comment': '',
}

There is no implicit mapping between the keys in this dictionary and schema attributes on the associated RegistryObject, it is simply information used for displaying jobs. Given a call of jobs.get_attribute(j.id, 'backend') it would not have to use the value of 'backend'. However, since this is a generic cache of data, JobRegistry would be within its rights to use 'status' to give the return value of jobs.get_attribute(j.id, 'status'). It is up to the registry how it uses this data.

Other registries may use this cache in other ways as they see fit so we propose referring to this cache generically as an "object metadata cache". Its purpose of to (depending on the registry in question) get "some" information about an object without a potentially expensive full load of it.

Open questions

  1. The box registry: Since we've reduced the types that can be stored in registries the box will need to be rewritten somewhat. It needs some thought but should be allowed to prevent progress in other, more important areas.
  2. Subjobs: To first-order this will continue to work as it does now but it opens up possibilities in the future for harmonising the job registry and ``SubJobXMLList`