New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Feature/columnar data serialization #708

Merged

TonyXiang8787 merged 31 commits into main from feature/columnar-data-serialization

Sep 5, 2024

Member

mgovers commented Aug 30, 2024

Step 5 & 6 of Implementation #548

mgovers added the feature label

mgovers requested review from TonyXiang8787 and figueroa1395

August 30, 2024 09:18

mgovers self-assigned this

mgovers force-pushed the feature/columnar-data-serialization branch from 81011bd to 6ab07b4 Compare

August 30, 2024 09:25

mgovers added 5 commits

August 30, 2024 11:26


          add unit tests for columnar (de)serializer

c73a699

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          add columnar data support to deserializer

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          make serializer tests pass as well

0a8fe41

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          cleanup

f8d9331

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          extend dataset with forgotten functionality

c7e47b2

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

mgovers force-pushed the feature/columnar-data-serialization branch from 6ab07b4 to c7e47b2 Compare

August 30, 2024 09:26

mgovers removed the request for review from TonyXiang8787

August 30, 2024 09:34


          fix includes

00ad355

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

mgovers commented

View reviewed changes

power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/dataset.hpp Show resolved Hide resolved

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved

...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp Outdated Show resolved Hide resolved

...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp Outdated Show resolved Hide resolved

mgovers and others added 5 commits

August 30, 2024 11:53


          make gcc/macos happy

9fed3bd

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          more gcc happiness

1c9fa3b

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          Merge branch 'main' into feature/columnar-data-serialization

f62def5


          clang-tidy

3d36714

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          Merge branch 'feature/columnar-data-serialization' of https://github.…

e2ccc5b

…com/PowerGridModel/power-grid-model into feature/columnar-data-serialization

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

figueroa1395 reviewed

View reviewed changes

power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/dataset.hpp Show resolved Hide resolved

power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/meta_data.hpp Show resolved Hide resolved

...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp Outdated Show resolved Hide resolved

...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp Outdated Show resolved Hide resolved

...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp Outdated Show resolved Hide resolved

...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp Outdated Show resolved Hide resolved

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved


          made deserializer work with more efficient attribute lookup

9c5a2d2

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

TonyXiang8787 reviewed

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved

mgovers added 2 commits

September 3, 2024 13:14


          vector instead of unordered_map

5383c01

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          sonar cloud + clang-tidy

d642f96

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

TonyXiang8787 reviewed

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved

TonyXiang8787 reviewed

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved

TonyXiang8787 reviewed

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved

TonyXiang8787 reviewed

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved

TonyXiang8787 reviewed

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Show resolved Hide resolved


          go to separate map and array view

587f2ff

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

TonyXiang8787 reviewed

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved

TonyXiang8787 reviewed

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp Outdated Show resolved Hide resolved


          fix lifetime of reordered attribute buffers

b908b45

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

Member

TonyXiang8787 commented Sep 4, 2024

The deserialization part is looking good now.

mgovers added 3 commits

September 4, 2024 16:50


          update serializer with similar logic to deserializer

747d29a

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          fix clang-tidy

7be28f4

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          fix edge case component not in dataset

1bdd43a

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

TonyXiang8787 reviewed

View reviewed changes

...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp Show resolved Hide resolved

mgovers added 3 commits

September 5, 2024 13:41


          remove unused functions

91e92ad

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          rename buffer view

404422e

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>


          make assertions more clear

233cee3

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

mgovers marked this pull request as ready for review

September 5, 2024 12:50

sonarqubecloud bot commented Sep 5, 2024

Quality Gate passed

Issues
14 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

TonyXiang8787 approved these changes

View reviewed changes

TonyXiang8787 added this pull request to the merge queue

Member

TonyXiang8787 commented Sep 5, 2024

@mgovers PR now in the merge queue. After merge, maybe you can run a serialization benchmark (using your old dataset) to compare the old and new version of row-based (de-)serialization performance. It should not have significant change.

Merged via the queue into main with commit fdb58ef

26 checks passed

TonyXiang8787 deleted the feature/columnar-data-serialization branch

September 5, 2024 15:02

Member Author

mgovers commented Sep 6, 2024 •

edited

Loading

bad news: the parsing step in the deserializer takes excessively longer. will

update dataset shape (800, 800) for all component types with random values for all attributes. Notice that total for deserialization is increadibly long but that that is mainly related to the python string stuff and maybe something we need to have a look at

v1.9.35

data	serializer creation	serialization	serialization total	deserializer creation + pre-parse	deserializer parse	deserializer total
msgpack [compact]	0.0005251000402495265 s	2.2746466000098735 s	2.8601861000061035 s	0.33473320002667606 s	55.92428519995883 s	56.386925800004974 s
msgpack [regular]	0.0005136000690981746 s	2.6831328000407666 s	3.5524667999707162 s	0.5777812999440357 s	99.07666269992478 s	99.91340609989129 s
json [compact]	0.0007071000291034579 s	29.271804499905556 s	60.09442940005101 s	12.122917199973017 s	56.96022929996252 s	70.54345009999815 s
json [regular]	0.0004412999842315912 s	36.51309969997965 s	76.11392769997474 s	15.74001260008663 s	100.83267979999073 s	119.05708439997397 s

v1.9.34

data	serializer creation	serialization	serialization total	deserializer creation + pre-parse	deserializer parse	deserializer total
msgpack [compact]	0.000408099964261055 s	2.672189600067213 s	3.146100400015712 s	0.2728536000940949 s	2.623856300022453 s	3.0367689999984577 s
msgpack [regular]	0.000543000060133636 s	2.9165597000392154 s	3.8790119000477716 s	0.5125961999874562 s	5.055858200066723 s	5.831057699979283 s
json [compact]	0.0004981999518349767 s	29.134515200043097 s	60.77616540004965 s	12.264068600023165 s	2.503174800076522 s	16.27681280008983 s
json [regular]	0.0005403000395745039 s	35.86926089995541 s	75.49933109991252 s	15.36376619990915 s	4.866490800050087 s	22.552572900080122 s

file sizes

data	size
msgpack [compact]	356'456 KB
msgpack [regular]	744'580 KB
json [compact]	902'956 KB
json [regular]	1'431'704 KB

I will follow-up on this

Member

TonyXiang8787 commented Sep 6, 2024 •

edited

Loading

@mgovers please format the table of 1.9.35

Member Author

mgovers commented Sep 6, 2024

benchmark script

# SPDX-FileCopyrightText: Contributors to the Power Grid Model project <powergridmodel@lfenergy.org>
#
# SPDX-License-Identifier: MPL-2.0

from pathlib import Path
from timeit import timeit
from typing import Callable
import numpy as np

from power_grid_model import initialize_array
from power_grid_model.core.dataset_definitions import ComponentType, DatasetType
from power_grid_model.utils import json_deserialize_from_file, msgpack_deserialize_from_file, msgpack_serialize_to_file, json_serialize_to_file


# def generate(shape):
#     data = {}
#     for comp in ComponentType:
#         arr = initialize_array(DatasetType.update, comp, shape=shape, empty=True)
#         for attribute in arr.dtype.names:
#             arr[attribute] = np.random.rand(*arr[attribute].shape)
#         data[comp] = arr

#     print("done generating")
#     return data


def serialize(func: Callable, file_path: Path, data, **kwargs):
    def execute():
        print(f"serializing: {file_path}")
        func(file_path, data, **kwargs)
        print(f"done serializing: {file_path}")

    print(f"timed {func.__name__}({file_path}): {timeit(execute, number=1)} s")


def deserialize(func: Callable, file_path: Path):
    def execute():
        print(f"deserializing: {file_path}")
        func(file_path)
        print(f"done deserializing: {file_path}")
    
    print(f"timed {func.__name__}({file_path}): {timeit(execute, number=1)} s")


if __name__ == "__main__":
    shape = (800, 800)
    # update_data = generate(shape)
    update_data = msgpack_deserialize_from_file(Path("tmp/data/base_data.msgpack"))

    serialize(msgpack_serialize_to_file, Path("tmp/data/all_random_compact.msgpack"), update_data, use_compact_list=True)
    deserialize(msgpack_deserialize_from_file, Path("tmp/data/all_random_compact.msgpack"))
    serialize(msgpack_serialize_to_file, Path("tmp/data/all_random.msgpack"), update_data)
    deserialize(msgpack_deserialize_from_file, Path("tmp/data/all_random.msgpack"))
    serialize(json_serialize_to_file, Path("tmp/data/all_random_compact.json"), update_data, use_compact_list=True)
    deserialize(json_deserialize_from_file, Path("tmp/data/all_random_compact.json"))
    serialize(json_serialize_to_file, Path("tmp/data/all_random.json"), update_data)
    deserialize(json_deserialize_from_file, Path("tmp/data/all_random.json"))

test script

// SPDX-FileCopyrightText: Contributors to the Power Grid Model project <powergridmodel@lfenergy.org>
//
// SPDX-License-Identifier: MPL-2.0

#include <power_grid_model/auxiliary/input.hpp>
#include <power_grid_model/auxiliary/meta_data_gen.hpp>
#include <power_grid_model/auxiliary/serialization/deserializer.hpp>
#include <power_grid_model/auxiliary/update.hpp>

#include <fstream>
#include <sstream>

namespace {
using namespace power_grid_model;
using namespace power_grid_model::meta_data;
} // namespace

int main() {
    std::string str = [] {
        std::ifstream f{"<some_file>.json"};
        std::stringstream sstr;
        sstr << f.rdbuf();
        return sstr.str();
    }();

    auto deserializer = Deserializer{from_json, str, meta_data_gen::meta_data};
    auto dataset = deserializer.get_dataset_info();
    auto info = dataset.get_description();
    std::vector<std::vector<char>> buffers{};
    for (Idx idx = 0; idx < meta_data_gen::meta_data.get_dataset("update").n_components(); ++idx) {
        buffers.push_back(
            std::vector<char>(info.component_info[idx].total_elements * info.component_info[idx].component->size));
        dataset.set_buffer(info.component_info[idx].component->name, nullptr, buffers[idx].data());
    }
    deserializer.parse();
    return 0;
}

Member

TonyXiang8787 commented Sep 6, 2024

@mgovers there is a major overhead (>90%) in the new de-serializer. We need to dump the data into a file and write a cpp executable to run the data. Then we can profile the overhead.

Member Author

mgovers commented Sep 6, 2024

i found out what's wrong. attempting to fix now

Member Author

mgovers commented Sep 6, 2024

same test script but using msgpack:

// SPDX-FileCopyrightText: Contributors to the Power Grid Model project <powergridmodel@lfenergy.org>
//
// SPDX-License-Identifier: MPL-2.0

#include <power_grid_model/auxiliary/input.hpp>
#include <power_grid_model/auxiliary/meta_data_gen.hpp>
#include <power_grid_model/auxiliary/serialization/deserializer.hpp>
#include <power_grid_model/auxiliary/update.hpp>

#include <fstream>
#include <sstream>

namespace {
using namespace power_grid_model;
using namespace power_grid_model::meta_data;
} // namespace

int main() {
    std::vector<char> result = [] {
        using namespace std::string_view_literals;
        constexpr auto file_path = "<msgpack_file>";
        std::vector<char> result(std::filesystem::file_size(file_path));
        std::ifstream f{file_path, std::ios::binary};
        f.read(result.data(), result.size());
        return result;
    }();

    auto deserializer = Deserializer{from_msgpack, result, meta_data_gen::meta_data};
    auto dataset = deserializer.get_dataset_info();
    auto info = dataset.get_description();
    std::vector<std::vector<char>> buffers{};
    for (Idx idx = 0; idx < meta_data_gen::meta_data.get_dataset("update").n_components(); ++idx) {
        buffers.push_back(
            std::vector<char>(info.component_info[idx].total_elements * info.component_info[idx].component->size));
        dataset.set_buffer(info.component_info[idx].component->name, nullptr, buffers[idx].data());
    }
    deserializer.parse();
    return 0;
}

mgovers commented

View reviewed changes

...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp

Comment on lines +706 to +708

+                      } else {
+                          parse_component(columnar, component_idx);
+                      }

Member Author

mgovers Sep 6, 2024 •

edited

Loading

I found out that this call will cause major overhead when no attributes are present. That means that the logic concerning empty attribute buffers is incorrect

Suggested change

      
                    } else {
          
                        parse_component(columnar, component_idx);
          
                    }
          
                    } else if (!dataset_handler_.get_buffer(component_idx).attributes.empty()) {
          
                        parse_component(columnar, component_idx);
          
                    }

This is not a sustainable fix (it points at the fact that there is some other problem) but will resolve the issue in the short-term

data	serializer creation	serialization	serialization total	deserializer creation + pre-parse	deserializer parse	deserializer total
msgpack [compact]	0.0004923000233247876 s	2.3669713999843225 s	3.0552195999771357 s	0.28873490006662905 s	2.858401000034064 s	3.285264000063762 s
msgpack [regular]	0.0007754000835120678 s	2.9432286999654025 s	3.9057689999463037 s	0.5173835000023246 s	5.383461399935186 s	6.166236900025979 s
json [compact]	0.00043050001841038465 s	29.926429499988444 s	63.13568380009383 s	14.445208700024523 s	3.2074557000305504 s	19.59107560000848 s
json [regular]	0.0008066999725997448 s	37.490393099957146 s	78.83358660002705 s	18.796793300076388 s	5.285955099971034 s	26.85202120000031 s

Member

TonyXiang8787 Sep 6, 2024

What has changed in your suggestion to avoid the overhead?

Member

TonyXiang8787 Sep 6, 2024

Your test scripts are testing row-based de-serialization right? So the else or else if in the segment is never reached.

Member Author

mgovers Sep 6, 2024

turns out it is... there's definitely an issue. I have very little time left before i need to go, though. Should we roll-back?

Member Author

mgovers Sep 6, 2024

ok found the problem in my test script (auto where there should've been auto& when declaring the dataset)

Member

TonyXiang8787 Sep 6, 2024

I see. So, the buffers are never set in the de-serializer. And all the components are treated column-based with no attributes.

This was referenced Sep 6, 2024

don't parse_skip() for very many components #714

Closed

more efficient columnar deserialization #716

Merged

mgovers mentioned this pull request

[Release] v1.10.0 #803

Closed

27 tasks

figueroa1395 mentioned this pull request

[FEATURE] Add support columnar data buffer to save memory usage #548

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature