Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more efficient columnar deserialization #716

Merged
merged 3 commits into from
Sep 9, 2024

Conversation

mgovers
Copy link
Member

@mgovers mgovers commented Sep 9, 2024

Makes deserialization more efficient compared to #708 as follows:

  • for each component type in the data set
    • if there are no row-based buffers or attribute buffers set to a WritableDataset component buffer:
      • we skip the component entirely
      • because it is neither row_based(*) nor columnar(*, with_attribute_buffers=true) and therefore there's nothing to do
    • otherwise, for each scenario
      • for each element of the current component type in that scenario
        • if a row-based buffer is set
          • parse the element as before
        • otherwise, peek whether the current element in the serialized data
          • if it is map-like
            • enter the element and parse as before
          • otherwise, it is array-like.
            • if none of the provided attribute buffers are present in the header,
              • skip this element
            • otherwise, enter the element and parse as before
              • there is no more efficient way to do this, because it's no more efficient way to skip, e.g., 3 attributes in the msgpack data

This provides a sustainable solution compared to the one proposed in #714

NOTE: the check if none of the provided attribute buffers are present in the header is done when reordering the attribute buffers and is reduced to a simple check whether it is empty. this is the best we can do because there's no more efficient way to read only a subset of the msgpack array while skipping all the rest

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@mgovers mgovers self-assigned this Sep 9, 2024
@mgovers mgovers added the improvement Improvement on internal implementation label Sep 9, 2024
@mgovers
Copy link
Member Author

mgovers commented Sep 9, 2024

test script:

// SPDX-FileCopyrightText: Contributors to the Power Grid Model project <powergridmodel@lfenergy.org>
//
// SPDX-License-Identifier: MPL-2.0

#include <power_grid_model/auxiliary/input.hpp>
#include <power_grid_model/auxiliary/meta_data_gen.hpp>
#include <power_grid_model/auxiliary/serialization/deserializer.hpp>
#include <power_grid_model/auxiliary/update.hpp>

#include <fstream>
#include <sstream>

namespace {
using namespace power_grid_model;
using namespace power_grid_model::meta_data;
} // namespace

int main() {
    std::vector<char> serialized_data = [] {
        using namespace std::string_view_literals;
        constexpr auto file_path = "<msgpack_data>";
        std::vector<char> result(std::filesystem::file_size(file_path));
        std::ifstream f{file_path, std::ios::binary};
        f.read(result.data(), result.size());
        return result;
    }();

    auto deserializer = Deserializer{from_msgpack, serialized_data, meta_data_gen::meta_data};
    auto& dataset = deserializer.get_dataset_info();
    auto const& info = dataset.get_description();

    std::vector<std::vector<std::byte>> row_buffers{};
    std::vector<std::vector<std::vector<std::byte>>> column_buffers{};

    auto const n_components = meta_data_gen::meta_data.get_dataset("update").n_components();
    for (Idx idx = 0; idx < n_components; ++idx) {
        auto const& meta_component = *info.component_info[idx].component;
        auto const buffer_size = info.component_info[idx].total_elements * meta_component.size;
        if (idx < n_components / 4) {
            row_buffers.push_back(std::vector<std::byte>(buffer_size));
            dataset.set_buffer(meta_component.name, nullptr, row_buffers.back().data());
        } else if (idx < n_components / 2) {
            auto& buffer = column_buffers.emplace_back();

            for (auto const& meta_attribute : meta_component.attributes) {
                buffer.push_back(std::vector<std::byte>(buffer_size));
                dataset.add_attribute_buffer(meta_component.name, meta_attribute.name, buffer.back().data());
            }
        }
    }
    deserializer.parse();
    return 0;
}

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@mgovers mgovers requested a review from figueroa1395 September 9, 2024 10:38
@TonyXiang8787
Copy link
Member

This is a nice improvement. But I still need to understand why parse_skip is so slow compared to parse_bool or parse_attribute. This needs attention before we just bypass the issue.

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Copy link

sonarqubecloud bot commented Sep 9, 2024

@TonyXiang8787 TonyXiang8787 added this pull request to the merge queue Sep 9, 2024
@mgovers
Copy link
Member Author

mgovers commented Sep 9, 2024

This is a nice improvement. But I still need to understand why parse_skip is so slow compared to parse_bool or parse_attribute. This needs attention before we just bypass the issue.

cfr. offline discussions + benchmarking, parse_skip is not slower than parse_bool or parse_attribute. Hence, this PR is ready for merge

Merged via the queue into main with commit 7ffcca3 Sep 9, 2024
26 checks passed
@TonyXiang8787 TonyXiang8787 deleted the feature/more-efficient-columnar-deserialization branch September 9, 2024 13:16
@mgovers mgovers mentioned this pull request Nov 5, 2024
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement on internal implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants