Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columnar data docs - native data interface serialization #788

Merged
merged 6 commits into from
Oct 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 44 additions & 3 deletions docs/advanced_documentation/native-data-interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,21 +55,57 @@ node_dtype = np.dtype(
To recreate the same node input dataset, we just create a `numpy` array using this special defined `dtype`.
The `numpy` array has exactly the same data layout as the `std::vector<NodeInput>` above.


```python
node = np.empty(shape=2, dtype=node_dtype)
node['id'] = [1, 2]
node['u_rated'] = [150e3, 10e3]
```

## Columnar data format

Additionally, we can represent the contents mentioned `NodeInput` struct in [Structured Array](#structured-array) for only specific attributes.
This is especially useful when the component in question, e.g., a transformer, has many default attributes. In that case, the user can save significantly on memory usage. Hence, we can term it into `NodeInputURated` which is of `double` type.
(note again, its representation in C++ core might be different than that of `NodeInputURated`).

One can create a `std::vector<NodeInputURated>` to hold input for multiple nodes.
In a similar example we create attribute data with `u_rated` of two nodes of 150 kV and 10 kV.

```c++
using NodeInputURated = double;
std::vector<NodeInputURated> node_u_rated_input{ 150.0e3 , 10.0e3 };
```

Similar would be the case for `NodeInputId` and `std::vector<NodeNodeInputId>`

To recreate this in Python using NumPy arrays, we should create it with the correct dtype - as mentioned in [Structured Array](#structured-array) - for each attribute.

```python
node_id = np.empty(shape=2, dtype=node_dtype["id"])
nitbharambe marked this conversation as resolved.
Show resolved Hide resolved
node_id['id'] = [1, 2]
node_u_rated = np.empty(shape=2, dtype=node_dtype["u_rated"])
node_u_rated['u_rated'] = [150e3, 10e3]
```

## Creating Dataset

We further save this array into a dictionary.
With other types of components, the dictionary is a valid input dataset for the constructor of `PowerGridModel`,
see [Python API Reference](../api_reference/python-api-reference.md).

For a row based data format,
nitbharambe marked this conversation as resolved.
Show resolved Hide resolved

```python
input_data = {'node': node}
```

or for columnar data format,

```python
input_data_columnar = {'node': {"id": node_id, "u_rated": node_u_rated}}
```

There can also be a combination of both row based and columnar data format in a dataset.

In the `ctypes` wrapper the pointers to all the array data will be retrieved and passed to the C++ code.
This is also true for result dataset.
The memory block of result dataset is allocated using `numpy`.
Expand Down Expand Up @@ -141,9 +177,14 @@ The code below creates an array which is compatible with transformer input datas
```python
from power_grid_model import ComponentType, DatasetType, power_grid_meta_data

transformer = np.empty(shape=5, dtype=power_grid_meta_data[DatasetType.input][ComponentType.transformer]['dtype'])
transformer_dtype = power_grid_meta_data[DatasetType.input][ComponentType.transformer].dtype
# Array for row based data
transformer = np.empty(shape=5, dtype=transformer_dtype)
# Array for columnar data
transformer_tap_pos = np.empty(shape=5, dtype=transformer_dtype["tap_pos"])

# direct string access is supported as well:
# transformer = np.empty(shape=5, dtype=power_grid_meta_data['input']['transformer']['dtype'])
# transformer = np.empty(shape=5, dtype=power_grid_meta_data[DatasetType.input][ComponentType.transformer].dtype)
```

Furthermore, there is an even more convenient function `initialize_array`
Expand Down
3 changes: 3 additions & 0 deletions docs/user_manual/serialization.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,9 @@ A [`ComponentDataset`](#json-schema-component-dataset-object) is an array of [`C
- [`ComponentDataset`](#json-schema-component-dataset-object): `Array`
- [`ComponentData`](#json-schema-component-data-object): the data per single component.

**NOTE:** The actual deserialized data representation may be row based or columnar, depending on the `data_filter` provided at deserialization (Check {py:function}`json_deserialize <power_grid_model.utils.json_deserialize>` for example).
Regardless of whether the deserialized data representation data is row based or columnar, the serialization format remains the same.

#### JSON schema component data object

A [`ComponentData`](#json-schema-component-data-object) object is either a [`HomogeneousComponentData`](#json-schema-homogeneous-component-data-object) object or an [`InhomogeneousComponentData`](#json-schema-inhomogeneous-component-data-object) object
Expand Down
Loading