ASV framework refactor proposal #2227

IvoDD · 2025-03-10T12:06:19Z

This commit adds some brief code example of how I think we can simplify the current framework.

The main idea is that currently the EnvConfigurationBase is responsible for too many things at the same time:

dataframe generation
library creation/population
differentiating between persistant/modifiable libraries.
managing different libraries needed for a benchmark

It being responsible for a large bundle of things makes it hard to:

reuse between different benchmarks (we basically end up with one new child of EnvConfigurationBase for each benchmark)
understand what some method actually does and we need to rely on lenghty documentation which is harder to keep up to date

My proposal is not code complete but show how we might go about simplifying this framework. New code is inside the environment_setup_refactored.py and show how we might go about replacing the current environment_setup.py.

I've also tweaked the LMDBReadWrite benchmark to show how we can use the new framework.

The proposal mainly consists of separating the four responsibilities of the EnvConfigurationBase

dataframe generation (now we have an interface DataFrameGenerator)
library population (now we have an interface LibraryPopulationPolicy)
persistant/modifiable (now we just have function which can give you what kind of library you need)
managing different libraries (now this is entirely responsibility of the benchmark as each benchmark might need completely different logic around the number/types of libraries it needs. E.g. read_write uses one persistent and one modifiable)

By separating these we can get better reusablity:

E.g. we can mix and match policies with dataframe generators for different benchmarks

Also now when I see the code in the benchmark I can how the libraries are created/populated instead of an opaque setup call which is unclear what it does.

I have left various inline comments explaining why I think the new approach is better/clearer. Any comments welcome.

This commit adds some brief code example of how I think we can simplify the current framework. The main idea is that currently the `EnvConfigurationBase` is responsible for too many things at the same time: - dataframe generation - library creation/population - differentiating between persistant/modifiable libraries. - managing different libraries needed for a benchmark It being responsible for a large bundle of things makes it hard to: - reuse between different benchmarks (we basically end up with one new child of `EnvConfigurationBase` for each benchmark) - understand what some method actually does and we need to rely on lenghty documentation which is harder to keep up to date My proposal is not code complete but show how we might go about simplifying this framework. New code is inside the `environment_setup_refactored.py` and show how we might go about replacing the current `environment_setup.py`. I've also tweaked the `LMDBReadWrite` benchmark to show how we can use the new framework. The proposal mainly consists of separating the four responsibilities of the `EnvConfigurationBase` - dataframe generation (now we have an interface `DataFrameGenerator`) - library population (now we have an interface `LibraryPopulationPolicy`) - persistant/modifiable (now we just have function which can give you what kind of library you need) - managing different libraries (now this is entirely responsibility of the benchmark as each benchmark might need completely different logic around the number/types of libraries it needs. E.g. read_write uses one persistent and one modifiable) By separating these we can get better reusablity: - E.g. we can mix and match policies with dataframe generators for different benchmarks Also now when I see the code in the benchmark I can how the libraries are created/populated instead of an opaque `setup` call which is unclear what it does.

grusev · 2025-03-10T13:59:50Z

python/arcticdb/util/environment_setup_refactored.py

I think that both versions are ok.

The original version main goal was to separate the creation structure from the ASV class. That is something that needed to be done because of the persostence library. Only this way the code for generating the environment could be safely tested without making changes to the tests - you fix the logic of generating the structure of objects outside of ASV tests then you go to ASV tests and use it wherever you want. So the class although monolythic has this main function. That makes the setup code llittle more hard to understand but easy to reuse outside of ASV as it does not follow ASV guidelines. But the benchamrk tests are mainly because of the tests not about the setup struct, so once developed the structure would be unwise to change due to loss of history for benchamrks.

The proposed version indeed makes at first glance perhaps things more clear during the review process as the code will be inside the ASV test, primarily. Having more decomposed objects perhaps would increase reusability. Still there are things that are missing from the model and the final solution would be more thick and perhaps again not quite clean here and there when missing things are added.

Thus for me it is ok to go to new way of constucting tests, just have to say that during adaptations of this approach to different benchamrks its cleanes might suffer.

grusev · 2025-03-10T14:02:07Z

python/arcticdb/util/environment_setup_refactored.py

+
+class LibraryType(Enum):
+    PERSISTENT = "PERMANENT"
+    MODIFIABLE = "MODIFIABLE"


would advise adding the concept of TEST library type which is persistent space for tests. Setting then throgh a property of the generator class set_test() could automatically make the code execute on separate space and not polute PERSISTENT space.

grusev · 2025-03-10T14:09:37Z

python/arcticdb/util/environment_setup_refactored.py

+# Using a separate abstraction for the population policy can allow us to be flexible with how we populate libraries.
+# E.g. One benchmark can use one population policy for read operations, a different for updates and a third for appends
+# Note: Using a kwargs generator to be passed to the df_generator allows customization of dataframe generation params (e.g. num_rows, num_cols, use_string_columns?)
+class LibraryPopulationPolicy(ABC):


We discussed with Ivo that in previous version the setup environment class was responsible for both library and symbol generation. In this version it will be responsible only for symbol population. Then ASV test will invoke it for each library it needs to populate.

Perhaps this would be ammended also with other methods, attributes:

number of rows

number of cols

number version

number of snapshots

etc.

I would suggest that populate_library and populate_persistent_library_if_missing are added to this class also.

I do not beleive we will have single only way to populate_library ... Or if we have single way it will be only that way. And then if we need more way to do that?

I aggree as we discussed offline, we need more complex LibraryPopulationPolicy, to accomodate for many versions/many snapshots.

I'd be happy to move the populate_library as part of the class, although I want to avoid the sitation where we end up with a complex inheritance structure of these population policies.

Probably just a single Policy which allows many symbols / many versions / snapshot frequency can serve for most our tests

grusev · 2025-03-10T14:16:55Z

python/benchmarks/real_read_write.py

-
-    def time_read_with_column_float(self, storage_info, num_rows):
+        ac = get_arctic_client(LMDBReadWrite.storage)
+        population_policy = PerRowsPopulationPolicy(LMDBReadWrite.params)


How and where we write tests in this mode for populators for differenet needs?

should it be part of the ASV class?

or should it be outside of it? Where? Any suggestions/ideas?

(in previous option all tests were in environment_setup.py)

grusev · 2025-03-10T14:50:22Z

python/arcticdb/util/environment_setup_refactored.py

+        pass
+
+
+# Currently we're using the same arctic client for both persistant and modifiable libraries.


I would suggest having a separate class for those 3 functions :

get_arctic_client
get_library_name
get_library
clear_all_modifiable_libs_from_this_process

Perhaps SharedStorageAccess or better name?

G-D-Petrov · 2025-03-11T07:34:44Z

python/arcticdb/util/environment_setup_refactored.py

+# It is quite clear what is this responsible for: only dataframe generation
+# Using such an abstraction can help us deduplicate the dataframe generation code between the different `EnvironmentSetup`s
+# Note: We use a class instead of a generator function to allow caching of dataframes in the state
+class DataFrameGenerator(ABC):


Not necessarily opposed to using a class, but if caching is the only thing we need it for, we can use something like this instead.

grusev reviewed Mar 10, 2025

View reviewed changes

G-D-Petrov reviewed Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASV framework refactor proposal #2227

ASV framework refactor proposal #2227

IvoDD commented Mar 10, 2025 •

edited

Loading

grusev Mar 10, 2025

grusev Mar 10, 2025

grusev Mar 10, 2025

IvoDD Mar 11, 2025

grusev Mar 10, 2025

grusev Mar 10, 2025

G-D-Petrov Mar 11, 2025

		pass


		# Currently we're using the same arctic client for both persistant and modifiable libraries.

ASV framework refactor proposal #2227

Are you sure you want to change the base?

ASV framework refactor proposal #2227

Conversation

IvoDD commented Mar 10, 2025 • edited Loading

grusev Mar 10, 2025

Choose a reason for hiding this comment

grusev Mar 10, 2025

Choose a reason for hiding this comment

grusev Mar 10, 2025

Choose a reason for hiding this comment

IvoDD Mar 11, 2025

Choose a reason for hiding this comment

grusev Mar 10, 2025

Choose a reason for hiding this comment

grusev Mar 10, 2025

Choose a reason for hiding this comment

G-D-Petrov Mar 11, 2025

Choose a reason for hiding this comment

IvoDD commented Mar 10, 2025 •

edited

Loading