more efficient columnar deserialization #716

mgovers · 2024-09-09T10:29:12Z

Makes deserialization more efficient compared to #708 as follows:

for each component type in the data set
- if there are no row-based buffers or attribute buffers set to a WritableDataset component buffer:
  - we skip the component entirely
  - because it is neither row_based(*) nor columnar(*, with_attribute_buffers=true) and therefore there's nothing to do
- otherwise, for each scenario
  - for each element of the current component type in that scenario
    - if a row-based buffer is set
      - parse the element as before
    - otherwise, peek whether the current element in the serialized data
      - if it is map-like
        
        enter the element and parse as before
      - otherwise, it is array-like.
        
        if none of the provided attribute buffers are present in the header,
        
        skip this element
        
        otherwise, enter the element and parse as before
        
        there is no more efficient way to do this, because it's no more efficient way to skip, e.g., 3 attributes in the msgpack data

This provides a sustainable solution compared to the one proposed in #714

NOTE: the check if none of the provided attribute buffers are present in the header is done when reordering the attribute buffers and is reduced to a simple check whether it is empty. this is the best we can do because there's no more efficient way to read only a subset of the msgpack array while skipping all the rest

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

mgovers · 2024-09-09T10:30:14Z

test script:

// SPDX-FileCopyrightText: Contributors to the Power Grid Model project <powergridmodel@lfenergy.org>
//
// SPDX-License-Identifier: MPL-2.0

#include <power_grid_model/auxiliary/input.hpp>
#include <power_grid_model/auxiliary/meta_data_gen.hpp>
#include <power_grid_model/auxiliary/serialization/deserializer.hpp>
#include <power_grid_model/auxiliary/update.hpp>

#include <fstream>
#include <sstream>

namespace {
using namespace power_grid_model;
using namespace power_grid_model::meta_data;
} // namespace

int main() {
    std::vector<char> serialized_data = [] {
        using namespace std::string_view_literals;
        constexpr auto file_path = "<msgpack_data>";
        std::vector<char> result(std::filesystem::file_size(file_path));
        std::ifstream f{file_path, std::ios::binary};
        f.read(result.data(), result.size());
        return result;
    }();

    auto deserializer = Deserializer{from_msgpack, serialized_data, meta_data_gen::meta_data};
    auto& dataset = deserializer.get_dataset_info();
    auto const& info = dataset.get_description();

    std::vector<std::vector<std::byte>> row_buffers{};
    std::vector<std::vector<std::vector<std::byte>>> column_buffers{};

    auto const n_components = meta_data_gen::meta_data.get_dataset("update").n_components();
    for (Idx idx = 0; idx < n_components; ++idx) {
        auto const& meta_component = *info.component_info[idx].component;
        auto const buffer_size = info.component_info[idx].total_elements * meta_component.size;
        if (idx < n_components / 4) {
            row_buffers.push_back(std::vector<std::byte>(buffer_size));
            dataset.set_buffer(meta_component.name, nullptr, row_buffers.back().data());
        } else if (idx < n_components / 2) {
            auto& buffer = column_buffers.emplace_back();

            for (auto const& meta_attribute : meta_component.attributes) {
                buffer.push_back(std::vector<std::byte>(buffer_size));
                dataset.add_attribute_buffer(meta_component.name, meta_attribute.name, buffer.back().data());
            }
        }
    }
    deserializer.parse();
    return 0;
}

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp

power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/common.hpp

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp

TonyXiang8787 · 2024-09-09T10:40:33Z

This is a nice improvement. But I still need to understand why parse_skip is so slow compared to parse_bool or parse_attribute. This needs attention before we just bypass the issue.

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

sonarqubecloud · 2024-09-09T11:49:39Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
97.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

mgovers · 2024-09-09T12:30:46Z

This is a nice improvement. But I still need to understand why parse_skip is so slow compared to parse_bool or parse_attribute. This needs attention before we just bypass the issue.

cfr. offline discussions + benchmarking, parse_skip is not slower than parse_bool or parse_attribute. Hence, this PR is ready for merge

more efficient columnar deserialization

de9ce75

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

mgovers requested a review from TonyXiang8787 September 9, 2024 10:29

mgovers self-assigned this Sep 9, 2024

mgovers added the improvement Improvement on internal implementation label Sep 9, 2024

clang-tidy

0405869

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

mgovers commented Sep 9, 2024

View reviewed changes

mgovers requested a review from figueroa1395 September 9, 2024 10:38

process comments

eb56b4b

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

TonyXiang8787 approved these changes Sep 9, 2024

View reviewed changes

figueroa1395 approved these changes Sep 9, 2024

View reviewed changes

TonyXiang8787 added this pull request to the merge queue Sep 9, 2024

Merged via the queue into main with commit 7ffcca3 Sep 9, 2024
26 checks passed

TonyXiang8787 deleted the feature/more-efficient-columnar-deserialization branch September 9, 2024 13:16

mgovers mentioned this pull request Nov 5, 2024

[Release] v1.10.0 #803

Closed

27 tasks

figueroa1395 mentioned this pull request Nov 7, 2024

[FEATURE] Add support columnar data buffer to save memory usage #548

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more efficient columnar deserialization #716

more efficient columnar deserialization #716

mgovers commented Sep 9, 2024

mgovers commented Sep 9, 2024

TonyXiang8787 commented Sep 9, 2024

sonarqubecloud bot commented Sep 9, 2024

mgovers commented Sep 9, 2024

more efficient columnar deserialization #716

more efficient columnar deserialization #716

Conversation

mgovers commented Sep 9, 2024

mgovers commented Sep 9, 2024

TonyXiang8787 commented Sep 9, 2024

sonarqubecloud bot commented Sep 9, 2024

Quality Gate passed

mgovers commented Sep 9, 2024