Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store the model definition into files that are written #358

Merged
merged 19 commits into from
Mar 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions doc/advanced_topics.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,126 @@ To implement your own transient event store, the only requirement is to set the
- Run pre-commit manually

`$ pre-commit run --all-files`

## Retrieving the EDM definition from a data file
It is possible to get the EDM definition(s) that was used to generate the
datatypes that are stored in a data file. This makes it possible to re-generate
the necessary code and build all libraries again in case they are not easily
available otherwise. To see which EDM definitions are available in a data file
use the `podio-dump` utility

```bash
podio-dump <data-file>
```
which will give an (exemplary) output like this
```
input file: <data-file>

EDM model definitions stored in this file: edm4hep

[...]
```

To actually dump the model definition to stdout use the `--dump-edm` option
and the name of the datamodel you want to dump:

```bash
podio-dump --dump-edm edm4hep <data-file> > dumped_edm4hep.yaml
```

Here we directly redirected the output to a yaml file that can then again be
used by the `podio_class_generator.py` to generate the corresponding c++ code
(or be passed to the cmake macros).

**Note that the dumped EDM definition is equivalent but not necessarily exactly
the same as the original EDM definition.** E.g. all the datatypes will have all
their fields (`Members`, `OneToOneRelations`, `OneToManyRelations`,
`VectorMembers`) defined, and defaulted to empty lists in case they were not
present in the original EDM definition. The reason for this is that the embedded
EDM definition is the pre-processed and validated one [as described
below](#technical-details-on-edm-definition-embedding)

### Accessing the EDM definition programmatically
The EDM definition can also be accessed programmatically via the
`[ROOT|SIO]FrameReader::getEDMDefinition` method. It takes an EDM name as its
hegner marked this conversation as resolved.
Show resolved Hide resolved
single argument and returns the EDM definition as a JSON string. Most likely
this has to be decoded into an actual JSON structure in order to be usable (e.g.
via `json.loads` in python to get a `dict`).

### Technical details on EDM definition embedding
The EDM definition is embedded into the core EDM library as a raw string literal
in JSON format. This string is generated into the `DatamodelDefinition.h` file as

```cpp
namespace <package_name>::meta {
static constexpr auto <package_name>__JSONDefinition = R"EDMDEFINITION(<json encoded definition>)EDMDEFINITION";
}
```

where `<package_name>` is the name of the EDM as passed to the
`podio_class_generator.py` (or the cmake macro). The `<json encoded definition>`
is obtained from the pre-processed EDM definition that is read from the yaml
file. During this pre-processing the EDM definition is validated, and optional
fields are filled with empty defaults. Additionally, the `includeSubfolder`
option will be populated with the actual include subfolder, in case it has been
set to `True` in the yaml file. Since the json encoded definition is generated
right before the pre-processed model is passed to the class generator, this
definition is equivalent, but not necessarily equal to the original definition.

#### The `DatamodelRegistry`
To make access to information about currently loaded and available datamodels a
bit easier the `DatamodelRegistry` (singleton) keeps a map of all loaded
datamodels and provides access to this information possible. In this context we
refer to an *EDM* as the shared library (and the corresponding public headers)
that have been compiled from code that has been generated from a *datamodel
definition* in the original YAML file. In general whenever we refer to a
*datamodel* in this context we mean the enitity as a whole, i.e. its definition
in a YAML file, the concrete implementation as an EDM, as well as other related
information that is related to it.

Currently the `DatamodelRegistry` provides mainly access to the original
definition of available datamodels via two methods:
```cpp
const std::string_view getDatamodelDefinition(const std::string& edmName) const;

const std::string_view getDatamodelDefinition(size_t index) const;
```

where `index` can be obtained from each collection via
`getDatamodelRegistryIndex`. That in turn simply calls
`<package_name>::meta::DatamodelRegistryIndex::value()`, another singleton like object
that takes care of registering an EDM definition to the `DatamodelRegistry`
during its static initialization. It is also defined in the
`DatamodelDefinition.h` header.

Since the datamodel definition is embedded as a raw string literal into the core
EDM shared library, it is in principle also relatively straight forward to
retrieve it from this library by inspecting the binary, e.g. via
```bash
readelf -p .rodata libedm4hep.so | grep options
```

which will result in something like

```
[ 300] {"options": {"getSyntax": true, "exposePODMembers": false, "includeSubfolder": "edm4hep/"}, "components": {<...>}, "datatypes": {<...>}}
```

#### I/O helpers for EDM definition storing
The `podio/utilities/DatamodelRegistryIOHelpers.h` header defines two utility
classes, that help with instrumenting readers and writers with functionality to
read and write all the necessary EDM definitions.

- The `DatamodelDefinitionCollector` is intended for usage in writers. It
essentially collects the datamodel definitions of all the collections it encounters.
The `registerDatamodelDefinition` method it provides should be called with every collection
that is written. The `getDatamodelDefinitionsToWrite` method returns a vector of all
datamodel names and their definition that were encountered during writing. **It is
then the writers responsibility to actually store this information into the
file**.
- The `DatamodelDefinitionHolder` is intended to be used by readers. It
provides the `getDatamodelDefinition` and `getAvailableDatamodels` methods.
**It is again the readers property to correctly populate it with the data it
has read from file.** Currently the `SIOFrameReader` and the `ROOTFrameReader`
use it and also offer the same functionality as public methods with the help
of it.
3 changes: 3 additions & 0 deletions include/podio/CollectionBase.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,9 @@ class CollectionBase {

/// print this collection to the passed stream
virtual void print(std::ostream& os = std::cout, bool flush = true) const = 0;

tmadlener marked this conversation as resolved.
Show resolved Hide resolved
/// Get the index in the DatatypeRegistry of the EDM this collection belongs to
virtual size_t getDatamodelRegistryIndex() const = 0;
};

} // namespace podio
Expand Down
99 changes: 99 additions & 0 deletions include/podio/DatamodelRegistry.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#ifndef PODIO_DATAMODELREGISTRY_H
#define PODIO_DATAMODELREGISTRY_H

#include <string>
#include <string_view>
#include <utility>
#include <vector>

namespace podio {

/**
* Global registry holding information about datamodels and datatypes defined
* therein that are currently known by podio (i.e. which have been dynamically
* loaded).
*
* This is a singleton which is (statically) populated during dynamic loading of
* generated EDMs. In this context an **EDM refers to the shared library** that
* is compiled from the generated code from a datamodel definition in YAML
* format. When we refer to a **datamodel** in this context we talk about the
* entity as a whole, i.e. its definition in a YAML file, but also the concrete
* implementation as an EDM, as well as all other information that is related to
* it. In the API of this registry this will be used, unless we want to
* highlight that we are referring to a specific part of a datamodel.
*/
class DatamodelRegistry {
public:
/// Get the registry
static const DatamodelRegistry& instance();

// Mutable instance only used for the initial registration!
static DatamodelRegistry& mutInstance();

~DatamodelRegistry() = default;
DatamodelRegistry(const DatamodelRegistry&) = delete;
DatamodelRegistry& operator=(const DatamodelRegistry&) = delete;
DatamodelRegistry(DatamodelRegistry&&) = delete;
DatamodelRegistry& operator=(const DatamodelRegistry&&) = delete;

/// Dedicated index value for collections that don't have a datamodel
/// definition (e.g. UserDataCollection)
static constexpr size_t NoDefinitionNecessary = -1;
/// Dedicated index value for error checking, used to default init the generated RegistryIndex
static constexpr size_t NoDefinitionAvailable = -2;

/**
* Get the definition (in JSON format) of the datamodel with the given
* edmName.
*
* If no datamodel with the given name can be found, an empty datamodel
* definition, i.e. an empty JSON object ("{}"), is returned.
*
* @param name The name of the datamodel
*/
const std::string_view getDatamodelDefinition(std::string_view name) const;

/**
* Get the defintion (in JSON format) of the datamodel wth the given index.
*
* If no datamodel is found under the given index, an empty datamodel
* definition, i.e. an empty JSON object ("{}"), is returned.
*
* @param index The datamodel definition index that can be obtained from each
* collection
*/
const std::string_view getDatamodelDefinition(size_t index) const;

/**
* Get the name of the datamodel that is stored under the given index.
*
* If no datamodel is found under the given index, an empty string is returned
*
* @param index The datamodel definition index that can be obtained from each
* collection
*/
const std::string& getDatamodelName(size_t index) const;

/**
* Register a datamodel return the index in the registry.
*
* This is the hook that is called during dynamic loading of an EDM to
* register information for this EDM. If an EDM has already been registered
* under this name, than the index to the existing EDM in the registry will be
* returned.
*
* @param name The name of the EDM that should be registered
* @param definition The datamodel definition from which this EDM has been
* generated in JSON format
*
*/
size_t registerDatamodel(std::string name, std::string_view definition);

private:
DatamodelRegistry() = default;
/// The stored definitions
std::vector<std::pair<std::string, std::string_view>> m_definitions{};
};
} // namespace podio

#endif // PODIO_DATAMODELREGISTRY_H
12 changes: 12 additions & 0 deletions include/podio/ROOTFrameReader.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "podio/CollectionBranches.h"
#include "podio/ROOTFrameData.h"
#include "podio/podioVersion.h"
#include "podio/utilities/DatamodelRegistryIOHelpers.h"

#include "TChain.h"

Expand Down Expand Up @@ -79,6 +80,16 @@ class ROOTFrameReader {
/// Get the names of all the availalable Frame categories in the current file(s)
std::vector<std::string_view> getAvailableCategories() const;

/// Get the datamodel definition for the given name
const std::string_view getDatamodelDefinition(const std::string& name) const {
return m_datamodelHolder.getDatamodelDefinition(name);
}

/// Get all names of the datamodels that ara available from this reader
std::vector<std::string> getAvailableDatamodels() const {
return m_datamodelHolder.getAvailableDatamodels();
}

private:
/**
* Helper struct to group together all the necessary state to read / process a
Expand Down Expand Up @@ -132,6 +143,7 @@ class ROOTFrameReader {
std::vector<std::string> m_availCategories{}; ///< All available categories from this file

podio::version::Version m_fileVersion{0, 0, 0};
DatamodelDefinitionHolder m_datamodelHolder{};
};

} // namespace podio
Expand Down
3 changes: 3 additions & 0 deletions include/podio/ROOTFrameWriter.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

#include "podio/CollectionBranches.h"
#include "podio/CollectionIDTable.h"
#include "podio/utilities/DatamodelRegistryIOHelpers.h"

#include "TFile.h"

Expand Down Expand Up @@ -80,6 +81,8 @@ class ROOTFrameWriter {

std::unique_ptr<TFile> m_file{nullptr}; ///< The storage file
std::unordered_map<std::string, CategoryInfo> m_categories{}; ///< All categories

DatamodelDefinitionCollector m_datamodelCollector{};
};

} // namespace podio
Expand Down
59 changes: 59 additions & 0 deletions include/podio/SIOBlock.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#include <podio/EventStore.h>
#include <podio/GenericParameters.h>
#include <podio/podioVersion.h>
#include <podio/utilities/TypeHelpers.h>

#include <sio/block.h>
#include <sio/io_device.h>
Expand All @@ -16,6 +17,7 @@
#include <string>
#include <string_view>
#include <tuple>
#include <vector>

namespace podio {

Expand All @@ -26,6 +28,34 @@ void handlePODDataSIO(devT& device, PODData* data, size_t size) {
device.data(dataPtr, count);
}

/// Write anything that iterates like an std::map
template <typename MapLikeT>
void writeMapLike(sio::write_device& device, const MapLikeT& map) {
device.data((int)map.size());
for (const auto& [key, value] : map) {
device.data(key);
device.data(value);
}
}

/// Read anything that iterates like an std::map
template <typename MapLikeT>
void readMapLike(sio::read_device& device, MapLikeT& map) {
int size;
device.data(size);
while (size--) {
detail::GetKeyType<MapLikeT> key;
device.data(key);
detail::GetMappedType<MapLikeT> value;
device.data(value);
if constexpr (podio::detail::isVector<MapLikeT>) {
map.emplace_back(std::move(key), std::move(value));
} else {
map.emplace(std::move(key), std::move(value));
}
}
}

/// Base class for sio::block handlers used with PODIO
class SIOBlock : public sio::block {

Expand Down Expand Up @@ -141,6 +171,32 @@ class SIOEventMetaDataBlock : public sio::block {
podio::GenericParameters* metadata{nullptr};
};

/**
* A block to serialize anything that behaves similar in iterating as a
* map<KeyT, ValueT>, e.g. vector<tuple<KeyT, ValueT>>, which is what is used
* internally to represent the data to be written.
*/
template <typename KeyT, typename ValueT>
struct SIOMapBlock : public sio::block {
SIOMapBlock() : sio::block("SIOMapBlock", sio::version::encode_version(0, 1)) {
}
SIOMapBlock(std::vector<std::tuple<KeyT, ValueT>>&& data) :
sio::block("SIOMapBlock", sio::version::encode_version(0, 1)), mapData(std::move(data)) {
}

SIOMapBlock(const SIOMapBlock&) = delete;
SIOMapBlock& operator=(const SIOMapBlock&) = delete;

void read(sio::read_device& device, sio::version_type) override {
readMapLike(device, mapData);
}
void write(sio::write_device& device) override {
writeMapLike(device, mapData);
}

hegner marked this conversation as resolved.
Show resolved Hide resolved
std::vector<std::tuple<KeyT, ValueT>> mapData{};
};

/**
* A block for handling the run and collection meta data
*/
Expand Down Expand Up @@ -219,6 +275,9 @@ namespace sio_helpers {
/// The name of the TOCRecord
static constexpr const char* SIOTocRecordName = "podio_SIO_TOC_Record";

/// The name of the record containing the EDM definitions in json format
static constexpr const char* SIOEDMDefinitionName = "podio_SIO_EDMDefinitions";

// should hopefully be enough for all practical purposes
using position_type = uint32_t;
} // namespace sio_helpers
Expand Down
Loading