Skip to content

Commit

Permalink
Store the model definition into files that are written (#358)
Browse files Browse the repository at this point in the history
* Split model reading into file reading and parsing

* Add possibility to dump parsed EDMs to JSON

* Add registry for datamodel JSON defintions

- Generate model definitions in JSON format as constexpr string literals
- Register constexpr string literals in registry

* Populate definition registry via static variable initialization

- Also make the collections query-able

* Write definition and read it back afterwards (ROOT)

* Read and write EDM definitions in SIO

* Refactor EDM defintion I/O functionality for less code duplication

* Add first version of model dumping

* Make dumped models look more like inputs

- Change order of main keys (options, components, datatypes)
- Slightly tweak formatting (as far as possible with PyYAML)

* Add roundtrip tests for stored EDM definitions

* Add documentation for EDM definition embedding

* Fix documentation

* Add warnings output when trying to retrieve non existant EDMs

* Rename EDMDefinitionRegistry and clarify documentation

* Fix python bindings and rename tests

* Make utility classes members instead of using them as mixin

* Rename header file to match new terminology

* Update documentation to match implementation again

* Update test names also in ignored tests list
  • Loading branch information
tmadlener authored Mar 7, 2023
1 parent ada8443 commit dc9b6ba
Show file tree
Hide file tree
Showing 36 changed files with 951 additions and 51 deletions.
123 changes: 123 additions & 0 deletions doc/advanced_topics.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,126 @@ To implement your own transient event store, the only requirement is to set the
- Run pre-commit manually

`$ pre-commit run --all-files`

## Retrieving the EDM definition from a data file
It is possible to get the EDM definition(s) that was used to generate the
datatypes that are stored in a data file. This makes it possible to re-generate
the necessary code and build all libraries again in case they are not easily
available otherwise. To see which EDM definitions are available in a data file
use the `podio-dump` utility

```bash
podio-dump <data-file>
```
which will give an (exemplary) output like this
```
input file: <data-file>
EDM model definitions stored in this file: edm4hep
[...]
```

To actually dump the model definition to stdout use the `--dump-edm` option
and the name of the datamodel you want to dump:

```bash
podio-dump --dump-edm edm4hep <data-file> > dumped_edm4hep.yaml
```

Here we directly redirected the output to a yaml file that can then again be
used by the `podio_class_generator.py` to generate the corresponding c++ code
(or be passed to the cmake macros).

**Note that the dumped EDM definition is equivalent but not necessarily exactly
the same as the original EDM definition.** E.g. all the datatypes will have all
their fields (`Members`, `OneToOneRelations`, `OneToManyRelations`,
`VectorMembers`) defined, and defaulted to empty lists in case they were not
present in the original EDM definition. The reason for this is that the embedded
EDM definition is the pre-processed and validated one [as described
below](#technical-details-on-edm-definition-embedding)

### Accessing the EDM definition programmatically
The EDM definition can also be accessed programmatically via the
`[ROOT|SIO]FrameReader::getEDMDefinition` method. It takes an EDM name as its
single argument and returns the EDM definition as a JSON string. Most likely
this has to be decoded into an actual JSON structure in order to be usable (e.g.
via `json.loads` in python to get a `dict`).

### Technical details on EDM definition embedding
The EDM definition is embedded into the core EDM library as a raw string literal
in JSON format. This string is generated into the `DatamodelDefinition.h` file as

```cpp
namespace <package_name>::meta {
static constexpr auto <package_name>__JSONDefinition = R"EDMDEFINITION(<json encoded definition>)EDMDEFINITION";
}
```
where `<package_name>` is the name of the EDM as passed to the
`podio_class_generator.py` (or the cmake macro). The `<json encoded definition>`
is obtained from the pre-processed EDM definition that is read from the yaml
file. During this pre-processing the EDM definition is validated, and optional
fields are filled with empty defaults. Additionally, the `includeSubfolder`
option will be populated with the actual include subfolder, in case it has been
set to `True` in the yaml file. Since the json encoded definition is generated
right before the pre-processed model is passed to the class generator, this
definition is equivalent, but not necessarily equal to the original definition.
#### The `DatamodelRegistry`
To make access to information about currently loaded and available datamodels a
bit easier the `DatamodelRegistry` (singleton) keeps a map of all loaded
datamodels and provides access to this information possible. In this context we
refer to an *EDM* as the shared library (and the corresponding public headers)
that have been compiled from code that has been generated from a *datamodel
definition* in the original YAML file. In general whenever we refer to a
*datamodel* in this context we mean the enitity as a whole, i.e. its definition
in a YAML file, the concrete implementation as an EDM, as well as other related
information that is related to it.
Currently the `DatamodelRegistry` provides mainly access to the original
definition of available datamodels via two methods:
```cpp
const std::string_view getDatamodelDefinition(const std::string& edmName) const;
const std::string_view getDatamodelDefinition(size_t index) const;
```

where `index` can be obtained from each collection via
`getDatamodelRegistryIndex`. That in turn simply calls
`<package_name>::meta::DatamodelRegistryIndex::value()`, another singleton like object
that takes care of registering an EDM definition to the `DatamodelRegistry`
during its static initialization. It is also defined in the
`DatamodelDefinition.h` header.

Since the datamodel definition is embedded as a raw string literal into the core
EDM shared library, it is in principle also relatively straight forward to
retrieve it from this library by inspecting the binary, e.g. via
```bash
readelf -p .rodata libedm4hep.so | grep options
```

which will result in something like

```
[ 300] {"options": {"getSyntax": true, "exposePODMembers": false, "includeSubfolder": "edm4hep/"}, "components": {<...>}, "datatypes": {<...>}}
```

#### I/O helpers for EDM definition storing
The `podio/utilities/DatamodelRegistryIOHelpers.h` header defines two utility
classes, that help with instrumenting readers and writers with functionality to
read and write all the necessary EDM definitions.

- The `DatamodelDefinitionCollector` is intended for usage in writers. It
essentially collects the datamodel definitions of all the collections it encounters.
The `registerDatamodelDefinition` method it provides should be called with every collection
that is written. The `getDatamodelDefinitionsToWrite` method returns a vector of all
datamodel names and their definition that were encountered during writing. **It is
then the writers responsibility to actually store this information into the
file**.
- The `DatamodelDefinitionHolder` is intended to be used by readers. It
provides the `getDatamodelDefinition` and `getAvailableDatamodels` methods.
**It is again the readers property to correctly populate it with the data it
has read from file.** Currently the `SIOFrameReader` and the `ROOTFrameReader`
use it and also offer the same functionality as public methods with the help
of it.
3 changes: 3 additions & 0 deletions include/podio/CollectionBase.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,9 @@ class CollectionBase {

/// print this collection to the passed stream
virtual void print(std::ostream& os = std::cout, bool flush = true) const = 0;

/// Get the index in the DatatypeRegistry of the EDM this collection belongs to
virtual size_t getDatamodelRegistryIndex() const = 0;
};

} // namespace podio
Expand Down
99 changes: 99 additions & 0 deletions include/podio/DatamodelRegistry.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#ifndef PODIO_DATAMODELREGISTRY_H
#define PODIO_DATAMODELREGISTRY_H

#include <string>
#include <string_view>
#include <utility>
#include <vector>

namespace podio {

/**
* Global registry holding information about datamodels and datatypes defined
* therein that are currently known by podio (i.e. which have been dynamically
* loaded).
*
* This is a singleton which is (statically) populated during dynamic loading of
* generated EDMs. In this context an **EDM refers to the shared library** that
* is compiled from the generated code from a datamodel definition in YAML
* format. When we refer to a **datamodel** in this context we talk about the
* entity as a whole, i.e. its definition in a YAML file, but also the concrete
* implementation as an EDM, as well as all other information that is related to
* it. In the API of this registry this will be used, unless we want to
* highlight that we are referring to a specific part of a datamodel.
*/
class DatamodelRegistry {
public:
/// Get the registry
static const DatamodelRegistry& instance();

// Mutable instance only used for the initial registration!
static DatamodelRegistry& mutInstance();

~DatamodelRegistry() = default;
DatamodelRegistry(const DatamodelRegistry&) = delete;
DatamodelRegistry& operator=(const DatamodelRegistry&) = delete;
DatamodelRegistry(DatamodelRegistry&&) = delete;
DatamodelRegistry& operator=(const DatamodelRegistry&&) = delete;

/// Dedicated index value for collections that don't have a datamodel
/// definition (e.g. UserDataCollection)
static constexpr size_t NoDefinitionNecessary = -1;
/// Dedicated index value for error checking, used to default init the generated RegistryIndex
static constexpr size_t NoDefinitionAvailable = -2;

/**
* Get the definition (in JSON format) of the datamodel with the given
* edmName.
*
* If no datamodel with the given name can be found, an empty datamodel
* definition, i.e. an empty JSON object ("{}"), is returned.
*
* @param name The name of the datamodel
*/
const std::string_view getDatamodelDefinition(std::string_view name) const;

/**
* Get the defintion (in JSON format) of the datamodel wth the given index.
*
* If no datamodel is found under the given index, an empty datamodel
* definition, i.e. an empty JSON object ("{}"), is returned.
*
* @param index The datamodel definition index that can be obtained from each
* collection
*/
const std::string_view getDatamodelDefinition(size_t index) const;

/**
* Get the name of the datamodel that is stored under the given index.
*
* If no datamodel is found under the given index, an empty string is returned
*
* @param index The datamodel definition index that can be obtained from each
* collection
*/
const std::string& getDatamodelName(size_t index) const;

/**
* Register a datamodel return the index in the registry.
*
* This is the hook that is called during dynamic loading of an EDM to
* register information for this EDM. If an EDM has already been registered
* under this name, than the index to the existing EDM in the registry will be
* returned.
*
* @param name The name of the EDM that should be registered
* @param definition The datamodel definition from which this EDM has been
* generated in JSON format
*
*/
size_t registerDatamodel(std::string name, std::string_view definition);

private:
DatamodelRegistry() = default;
/// The stored definitions
std::vector<std::pair<std::string, std::string_view>> m_definitions{};
};
} // namespace podio

#endif // PODIO_DATAMODELREGISTRY_H
12 changes: 12 additions & 0 deletions include/podio/ROOTFrameReader.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "podio/CollectionBranches.h"
#include "podio/ROOTFrameData.h"
#include "podio/podioVersion.h"
#include "podio/utilities/DatamodelRegistryIOHelpers.h"

#include "TChain.h"

Expand Down Expand Up @@ -79,6 +80,16 @@ class ROOTFrameReader {
/// Get the names of all the availalable Frame categories in the current file(s)
std::vector<std::string_view> getAvailableCategories() const;

/// Get the datamodel definition for the given name
const std::string_view getDatamodelDefinition(const std::string& name) const {
return m_datamodelHolder.getDatamodelDefinition(name);
}

/// Get all names of the datamodels that ara available from this reader
std::vector<std::string> getAvailableDatamodels() const {
return m_datamodelHolder.getAvailableDatamodels();
}

private:
/**
* Helper struct to group together all the necessary state to read / process a
Expand Down Expand Up @@ -132,6 +143,7 @@ class ROOTFrameReader {
std::vector<std::string> m_availCategories{}; ///< All available categories from this file

podio::version::Version m_fileVersion{0, 0, 0};
DatamodelDefinitionHolder m_datamodelHolder{};
};

} // namespace podio
Expand Down
3 changes: 3 additions & 0 deletions include/podio/ROOTFrameWriter.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

#include "podio/CollectionBranches.h"
#include "podio/CollectionIDTable.h"
#include "podio/utilities/DatamodelRegistryIOHelpers.h"

#include "TFile.h"

Expand Down Expand Up @@ -80,6 +81,8 @@ class ROOTFrameWriter {

std::unique_ptr<TFile> m_file{nullptr}; ///< The storage file
std::unordered_map<std::string, CategoryInfo> m_categories{}; ///< All categories

DatamodelDefinitionCollector m_datamodelCollector{};
};

} // namespace podio
Expand Down
59 changes: 59 additions & 0 deletions include/podio/SIOBlock.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#include <podio/EventStore.h>
#include <podio/GenericParameters.h>
#include <podio/podioVersion.h>
#include <podio/utilities/TypeHelpers.h>

#include <sio/block.h>
#include <sio/io_device.h>
Expand All @@ -16,6 +17,7 @@
#include <string>
#include <string_view>
#include <tuple>
#include <vector>

namespace podio {

Expand All @@ -26,6 +28,34 @@ void handlePODDataSIO(devT& device, PODData* data, size_t size) {
device.data(dataPtr, count);
}

/// Write anything that iterates like an std::map
template <typename MapLikeT>
void writeMapLike(sio::write_device& device, const MapLikeT& map) {
device.data((int)map.size());
for (const auto& [key, value] : map) {
device.data(key);
device.data(value);
}
}

/// Read anything that iterates like an std::map
template <typename MapLikeT>
void readMapLike(sio::read_device& device, MapLikeT& map) {
int size;
device.data(size);
while (size--) {
detail::GetKeyType<MapLikeT> key;
device.data(key);
detail::GetMappedType<MapLikeT> value;
device.data(value);
if constexpr (podio::detail::isVector<MapLikeT>) {
map.emplace_back(std::move(key), std::move(value));
} else {
map.emplace(std::move(key), std::move(value));
}
}
}

/// Base class for sio::block handlers used with PODIO
class SIOBlock : public sio::block {

Expand Down Expand Up @@ -141,6 +171,32 @@ class SIOEventMetaDataBlock : public sio::block {
podio::GenericParameters* metadata{nullptr};
};

/**
* A block to serialize anything that behaves similar in iterating as a
* map<KeyT, ValueT>, e.g. vector<tuple<KeyT, ValueT>>, which is what is used
* internally to represent the data to be written.
*/
template <typename KeyT, typename ValueT>
struct SIOMapBlock : public sio::block {
SIOMapBlock() : sio::block("SIOMapBlock", sio::version::encode_version(0, 1)) {
}
SIOMapBlock(std::vector<std::tuple<KeyT, ValueT>>&& data) :
sio::block("SIOMapBlock", sio::version::encode_version(0, 1)), mapData(std::move(data)) {
}

SIOMapBlock(const SIOMapBlock&) = delete;
SIOMapBlock& operator=(const SIOMapBlock&) = delete;

void read(sio::read_device& device, sio::version_type) override {
readMapLike(device, mapData);
}
void write(sio::write_device& device) override {
writeMapLike(device, mapData);
}

std::vector<std::tuple<KeyT, ValueT>> mapData{};
};

/**
* A block for handling the run and collection meta data
*/
Expand Down Expand Up @@ -219,6 +275,9 @@ namespace sio_helpers {
/// The name of the TOCRecord
static constexpr const char* SIOTocRecordName = "podio_SIO_TOC_Record";

/// The name of the record containing the EDM definitions in json format
static constexpr const char* SIOEDMDefinitionName = "podio_SIO_EDMDefinitions";

// should hopefully be enough for all practical purposes
using position_type = uint32_t;
} // namespace sio_helpers
Expand Down
Loading

0 comments on commit dc9b6ba

Please sign in to comment.