Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASV framework refactor proposal #2227

Draft
wants to merge 1 commit into
base: asv_s3_more
Choose a base branch
from

Conversation

IvoDD
Copy link
Collaborator

@IvoDD IvoDD commented Mar 10, 2025

This commit adds some brief code example of how I think we can simplify the current framework.

The main idea is that currently the EnvConfigurationBase is responsible for too many things at the same time:

  • dataframe generation
  • library creation/population
  • differentiating between persistant/modifiable libraries.
  • managing different libraries needed for a benchmark

It being responsible for a large bundle of things makes it hard to:

  • reuse between different benchmarks (we basically end up with one new child of EnvConfigurationBase for each benchmark)
  • understand what some method actually does and we need to rely on lenghty documentation which is harder to keep up to date

My proposal is not code complete but show how we might go about simplifying this framework. New code is inside the environment_setup_refactored.py and show how we might go about replacing the current environment_setup.py.

I've also tweaked the LMDBReadWrite benchmark to show how we can use the new framework.

The proposal mainly consists of separating the four responsibilities of the EnvConfigurationBase

  • dataframe generation (now we have an interface DataFrameGenerator)
  • library population (now we have an interface LibraryPopulationPolicy)
  • persistant/modifiable (now we just have function which can give you what kind of library you need)
  • managing different libraries (now this is entirely responsibility of the benchmark as each benchmark might need completely different logic around the number/types of libraries it needs. E.g. read_write uses one persistent and one modifiable)

By separating these we can get better reusablity:

  • E.g. we can mix and match policies with dataframe generators for different benchmarks

Also now when I see the code in the benchmark I can how the libraries are created/populated instead of an opaque setup call which is unclear what it does.

I have left various inline comments explaining why I think the new approach is better/clearer. Any comments welcome.

This commit adds some brief code example of how I think we can simplify
the current framework.

The main idea is that currently the `EnvConfigurationBase` is
responsible for too many things at the same time:
- dataframe generation
- library creation/population
- differentiating between persistant/modifiable libraries.
- managing different libraries needed for a benchmark

It being responsible for a large bundle of things makes it hard to:
- reuse between different benchmarks (we basically end up with one new
  child of `EnvConfigurationBase` for each benchmark)
- understand what some method actually does and we need to rely on
  lenghty documentation which is harder to keep up to date

My proposal is not code complete but show how we might go about
simplifying this framework. New code is inside the
`environment_setup_refactored.py` and show how we might go about
replacing the current `environment_setup.py`.

I've also tweaked the `LMDBReadWrite` benchmark to show how we can use
the new framework.

The proposal mainly consists of separating the four responsibilities of
the `EnvConfigurationBase`
- dataframe generation (now we have an interface `DataFrameGenerator`)
- library population (now we have an interface `LibraryPopulationPolicy`)
- persistant/modifiable (now we just have function which can give you
  what kind of library you need)
- managing different libraries (now this is entirely responsibility of
  the benchmark as each benchmark might need completely different logic
around the number/types of libraries it needs. E.g. read_write uses one
persistent and one modifiable)

By separating these we can get better reusablity:
- E.g. we can mix and match policies with dataframe generators for
  different benchmarks

Also now when I see the code in the benchmark I can how the libraries
are created/populated instead of an opaque `setup` call which is unclear
what it does.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that both versions are ok.

The original version main goal was to separate the creation structure from the ASV class. That is something that needed to be done because of the persostence library. Only this way the code for generating the environment could be safely tested without making changes to the tests - you fix the logic of generating the structure of objects outside of ASV tests then you go to ASV tests and use it wherever you want. So the class although monolythic has this main function. That makes the setup code llittle more hard to understand but easy to reuse outside of ASV as it does not follow ASV guidelines. But the benchamrk tests are mainly because of the tests not about the setup struct, so once developed the structure would be unwise to change due to loss of history for benchamrks.

The proposed version indeed makes at first glance perhaps things more clear during the review process as the code will be inside the ASV test, primarily. Having more decomposed objects perhaps would increase reusability. Still there are things that are missing from the model and the final solution would be more thick and perhaps again not quite clean here and there when missing things are added.

Thus for me it is ok to go to new way of constucting tests, just have to say that during adaptations of this approach to different benchamrks its cleanes might suffer.


class LibraryType(Enum):
PERSISTENT = "PERMANENT"
MODIFIABLE = "MODIFIABLE"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would advise adding the concept of TEST library type which is persistent space for tests. Setting then throgh a property of the generator class set_test() could automatically make the code execute on separate space and not polute PERSISTENT space.

# Using a separate abstraction for the population policy can allow us to be flexible with how we populate libraries.
# E.g. One benchmark can use one population policy for read operations, a different for updates and a third for appends
# Note: Using a kwargs generator to be passed to the df_generator allows customization of dataframe generation params (e.g. num_rows, num_cols, use_string_columns?)
class LibraryPopulationPolicy(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed with Ivo that in previous version the setup environment class was responsible for both library and symbol generation. In this version it will be responsible only for symbol population. Then ASV test will invoke it for each library it needs to populate.

Perhaps this would be ammended also with other methods, attributes:

  • number of rows
  • number of cols
  • number version
  • number of snapshots
  • etc.

I would suggest that populate_library and populate_persistent_library_if_missing are added to this class also.

I do not beleive we will have single only way to populate_library ... Or if we have single way it will be only that way. And then if we need more way to do that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aggree as we discussed offline, we need more complex LibraryPopulationPolicy, to accomodate for many versions/many snapshots.

I'd be happy to move the populate_library as part of the class, although I want to avoid the sitation where we end up with a complex inheritance structure of these population policies.

Probably just a single Policy which allows many symbols / many versions / snapshot frequency can serve for most our tests


def time_read_with_column_float(self, storage_info, num_rows):
ac = get_arctic_client(LMDBReadWrite.storage)
population_policy = PerRowsPopulationPolicy(LMDBReadWrite.params)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How and where we write tests in this mode for populators for differenet needs?

  • should it be part of the ASV class?
  • or should it be outside of it? Where? Any suggestions/ideas?

(in previous option all tests were in environment_setup.py)

pass


# Currently we're using the same arctic client for both persistant and modifiable libraries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest having a separate class for those 3 functions :

get_arctic_client
get_library_name
get_library
clear_all_modifiable_libs_from_this_process

Perhaps SharedStorageAccess or better name?

# It is quite clear what is this responsible for: only dataframe generation
# Using such an abstraction can help us deduplicate the dataframe generation code between the different `EnvironmentSetup`s
# Note: We use a class instead of a generator function to allow caching of dataframes in the state
class DataFrameGenerator(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily opposed to using a class, but if caching is the only thing we need it for, we can use something like this instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants