Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/columnar data serialization #708

Merged
merged 31 commits into from
Sep 5, 2024

Conversation

mgovers
Copy link
Member

@mgovers mgovers commented Aug 30, 2024

Step 5 & 6 of Implementation #548

@mgovers mgovers added the feature New feature or request label Aug 30, 2024
@mgovers mgovers self-assigned this Aug 30, 2024
@mgovers mgovers force-pushed the feature/columnar-data-serialization branch from 81011bd to 6ab07b4 Compare August 30, 2024 09:25
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@mgovers mgovers force-pushed the feature/columnar-data-serialization branch from 6ab07b4 to c7e47b2 Compare August 30, 2024 09:26
@mgovers mgovers removed the request for review from TonyXiang8787 August 30, 2024 09:34
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
mgovers and others added 5 commits August 30, 2024 11:53
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
…com/PowerGridModel/power-grid-model into feature/columnar-data-serialization

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@TonyXiang8787
Copy link
Member

The deserialization part is looking good now.

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@mgovers mgovers marked this pull request as ready for review September 5, 2024 12:50
Copy link

sonarqubecloud bot commented Sep 5, 2024

@TonyXiang8787 TonyXiang8787 added this pull request to the merge queue Sep 5, 2024
@TonyXiang8787
Copy link
Member

@mgovers PR now in the merge queue. After merge, maybe you can run a serialization benchmark (using your old dataset) to compare the old and new version of row-based (de-)serialization performance. It should not have significant change.

Merged via the queue into main with commit fdb58ef Sep 5, 2024
26 checks passed
@TonyXiang8787 TonyXiang8787 deleted the feature/columnar-data-serialization branch September 5, 2024 15:02
@mgovers
Copy link
Member Author

mgovers commented Sep 6, 2024

bad news: the parsing step in the deserializer takes excessively longer. will

update dataset shape (800, 800) for all component types with random values for all attributes. Notice that total for deserialization is increadibly long but that that is mainly related to the python string stuff and maybe something we need to have a look at

v1.9.35

data serializer creation serialization serialization total deserializer creation + pre-parse deserializer parse deserializer total
msgpack [compact] 0.0005251000402495265 s 2.2746466000098735 s 2.8601861000061035 s 0.33473320002667606 s 55.92428519995883 s 56.386925800004974 s
msgpack [regular] 0.0005136000690981746 s 2.6831328000407666 s 3.5524667999707162 s 0.5777812999440357 s 99.07666269992478 s 99.91340609989129 s
json [compact] 0.0007071000291034579 s 29.271804499905556 s 60.09442940005101 s 12.122917199973017 s 56.96022929996252 s 70.54345009999815 s
json [regular] 0.0004412999842315912 s 36.51309969997965 s 76.11392769997474 s 15.74001260008663 s 100.83267979999073 s 119.05708439997397 s

v1.9.34

data serializer creation serialization serialization total deserializer creation + pre-parse deserializer parse deserializer total
msgpack [compact] 0.000408099964261055 s 2.672189600067213 s 3.146100400015712 s 0.2728536000940949 s 2.623856300022453 s 3.0367689999984577 s
msgpack [regular] 0.000543000060133636 s 2.9165597000392154 s 3.8790119000477716 s 0.5125961999874562 s 5.055858200066723 s 5.831057699979283 s
json [compact] 0.0004981999518349767 s 29.134515200043097 s 60.77616540004965 s 12.264068600023165 s 2.503174800076522 s 16.27681280008983 s
json [regular] 0.0005403000395745039 s 35.86926089995541 s 75.49933109991252 s 15.36376619990915 s 4.866490800050087 s 22.552572900080122 s

file sizes

data size
msgpack [compact] 356'456 KB
msgpack [regular] 744'580 KB
json [compact] 902'956 KB
json [regular] 1'431'704 KB

I will follow-up on this

@TonyXiang8787
Copy link
Member

TonyXiang8787 commented Sep 6, 2024

@mgovers please format the table of 1.9.35

@mgovers
Copy link
Member Author

mgovers commented Sep 6, 2024

benchmark script

# SPDX-FileCopyrightText: Contributors to the Power Grid Model project <powergridmodel@lfenergy.org>
#
# SPDX-License-Identifier: MPL-2.0

from pathlib import Path
from timeit import timeit
from typing import Callable
import numpy as np

from power_grid_model import initialize_array
from power_grid_model.core.dataset_definitions import ComponentType, DatasetType
from power_grid_model.utils import json_deserialize_from_file, msgpack_deserialize_from_file, msgpack_serialize_to_file, json_serialize_to_file


# def generate(shape):
#     data = {}
#     for comp in ComponentType:
#         arr = initialize_array(DatasetType.update, comp, shape=shape, empty=True)
#         for attribute in arr.dtype.names:
#             arr[attribute] = np.random.rand(*arr[attribute].shape)
#         data[comp] = arr

#     print("done generating")
#     return data


def serialize(func: Callable, file_path: Path, data, **kwargs):
    def execute():
        print(f"serializing: {file_path}")
        func(file_path, data, **kwargs)
        print(f"done serializing: {file_path}")

    print(f"timed {func.__name__}({file_path}): {timeit(execute, number=1)} s")


def deserialize(func: Callable, file_path: Path):
    def execute():
        print(f"deserializing: {file_path}")
        func(file_path)
        print(f"done deserializing: {file_path}")
    
    print(f"timed {func.__name__}({file_path}): {timeit(execute, number=1)} s")


if __name__ == "__main__":
    shape = (800, 800)
    # update_data = generate(shape)
    update_data = msgpack_deserialize_from_file(Path("tmp/data/base_data.msgpack"))

    serialize(msgpack_serialize_to_file, Path("tmp/data/all_random_compact.msgpack"), update_data, use_compact_list=True)
    deserialize(msgpack_deserialize_from_file, Path("tmp/data/all_random_compact.msgpack"))
    serialize(msgpack_serialize_to_file, Path("tmp/data/all_random.msgpack"), update_data)
    deserialize(msgpack_deserialize_from_file, Path("tmp/data/all_random.msgpack"))
    serialize(json_serialize_to_file, Path("tmp/data/all_random_compact.json"), update_data, use_compact_list=True)
    deserialize(json_deserialize_from_file, Path("tmp/data/all_random_compact.json"))
    serialize(json_serialize_to_file, Path("tmp/data/all_random.json"), update_data)
    deserialize(json_deserialize_from_file, Path("tmp/data/all_random.json"))

test script

// SPDX-FileCopyrightText: Contributors to the Power Grid Model project <powergridmodel@lfenergy.org>
//
// SPDX-License-Identifier: MPL-2.0

#include <power_grid_model/auxiliary/input.hpp>
#include <power_grid_model/auxiliary/meta_data_gen.hpp>
#include <power_grid_model/auxiliary/serialization/deserializer.hpp>
#include <power_grid_model/auxiliary/update.hpp>

#include <fstream>
#include <sstream>

namespace {
using namespace power_grid_model;
using namespace power_grid_model::meta_data;
} // namespace

int main() {
    std::string str = [] {
        std::ifstream f{"<some_file>.json"};
        std::stringstream sstr;
        sstr << f.rdbuf();
        return sstr.str();
    }();

    auto deserializer = Deserializer{from_json, str, meta_data_gen::meta_data};
    auto dataset = deserializer.get_dataset_info();
    auto info = dataset.get_description();
    std::vector<std::vector<char>> buffers{};
    for (Idx idx = 0; idx < meta_data_gen::meta_data.get_dataset("update").n_components(); ++idx) {
        buffers.push_back(
            std::vector<char>(info.component_info[idx].total_elements * info.component_info[idx].component->size));
        dataset.set_buffer(info.component_info[idx].component->name, nullptr, buffers[idx].data());
    }
    deserializer.parse();
    return 0;
}

@TonyXiang8787
Copy link
Member

@mgovers there is a major overhead (>90%) in the new de-serializer. We need to dump the data into a file and write a cpp executable to run the data. Then we can profile the overhead.

@mgovers
Copy link
Member Author

mgovers commented Sep 6, 2024

i found out what's wrong. attempting to fix now

@mgovers
Copy link
Member Author

mgovers commented Sep 6, 2024

same test script but using msgpack:

// SPDX-FileCopyrightText: Contributors to the Power Grid Model project <powergridmodel@lfenergy.org>
//
// SPDX-License-Identifier: MPL-2.0

#include <power_grid_model/auxiliary/input.hpp>
#include <power_grid_model/auxiliary/meta_data_gen.hpp>
#include <power_grid_model/auxiliary/serialization/deserializer.hpp>
#include <power_grid_model/auxiliary/update.hpp>

#include <fstream>
#include <sstream>

namespace {
using namespace power_grid_model;
using namespace power_grid_model::meta_data;
} // namespace

int main() {
    std::vector<char> result = [] {
        using namespace std::string_view_literals;
        constexpr auto file_path = "<msgpack_file>";
        std::vector<char> result(std::filesystem::file_size(file_path));
        std::ifstream f{file_path, std::ios::binary};
        f.read(result.data(), result.size());
        return result;
    }();

    auto deserializer = Deserializer{from_msgpack, result, meta_data_gen::meta_data};
    auto dataset = deserializer.get_dataset_info();
    auto info = dataset.get_description();
    std::vector<std::vector<char>> buffers{};
    for (Idx idx = 0; idx < meta_data_gen::meta_data.get_dataset("update").n_components(); ++idx) {
        buffers.push_back(
            std::vector<char>(info.component_info[idx].total_elements * info.component_info[idx].component->size));
        dataset.set_buffer(info.component_info[idx].component->name, nullptr, buffers[idx].data());
    }
    deserializer.parse();
    return 0;
}

Comment on lines +706 to +708
} else {
parse_component(columnar, component_idx);
}
Copy link
Member Author

@mgovers mgovers Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that this call will cause major overhead when no attributes are present. That means that the logic concerning empty attribute buffers is incorrect

Suggested change
} else {
parse_component(columnar, component_idx);
}
} else if (!dataset_handler_.get_buffer(component_idx).attributes.empty()) {
parse_component(columnar, component_idx);
}

This is not a sustainable fix (it points at the fact that there is some other problem) but will resolve the issue in the short-term

data serializer creation serialization serialization total deserializer creation + pre-parse deserializer parse deserializer total
msgpack [compact] 0.0004923000233247876 s 2.3669713999843225 s 3.0552195999771357 s 0.28873490006662905 s 2.858401000034064 s 3.285264000063762 s
msgpack [regular] 0.0007754000835120678 s 2.9432286999654025 s 3.9057689999463037 s 0.5173835000023246 s 5.383461399935186 s 6.166236900025979 s
json [compact] 0.00043050001841038465 s 29.926429499988444 s 63.13568380009383 s 14.445208700024523 s 3.2074557000305504 s 19.59107560000848 s
json [regular] 0.0008066999725997448 s 37.490393099957146 s 78.83358660002705 s 18.796793300076388 s 5.285955099971034 s 26.85202120000031 s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What has changed in your suggestion to avoid the overhead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test scripts are testing row-based de-serialization right? So the else or else if in the segment is never reached.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out it is... there's definitely an issue. I have very little time left before i need to go, though. Should we roll-back?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok found the problem in my test script (auto where there should've been auto& when declaring the dataset)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So, the buffers are never set in the de-serializer. And all the components are treated column-based with no attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants