-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASV framework refactor proposal #2227
base: asv_s3_more
Are you sure you want to change the base?
Conversation
This commit adds some brief code example of how I think we can simplify the current framework. The main idea is that currently the `EnvConfigurationBase` is responsible for too many things at the same time: - dataframe generation - library creation/population - differentiating between persistant/modifiable libraries. - managing different libraries needed for a benchmark It being responsible for a large bundle of things makes it hard to: - reuse between different benchmarks (we basically end up with one new child of `EnvConfigurationBase` for each benchmark) - understand what some method actually does and we need to rely on lenghty documentation which is harder to keep up to date My proposal is not code complete but show how we might go about simplifying this framework. New code is inside the `environment_setup_refactored.py` and show how we might go about replacing the current `environment_setup.py`. I've also tweaked the `LMDBReadWrite` benchmark to show how we can use the new framework. The proposal mainly consists of separating the four responsibilities of the `EnvConfigurationBase` - dataframe generation (now we have an interface `DataFrameGenerator`) - library population (now we have an interface `LibraryPopulationPolicy`) - persistant/modifiable (now we just have function which can give you what kind of library you need) - managing different libraries (now this is entirely responsibility of the benchmark as each benchmark might need completely different logic around the number/types of libraries it needs. E.g. read_write uses one persistent and one modifiable) By separating these we can get better reusablity: - E.g. we can mix and match policies with dataframe generators for different benchmarks Also now when I see the code in the benchmark I can how the libraries are created/populated instead of an opaque `setup` call which is unclear what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that both versions are ok.
The original version main goal was to separate the creation structure from the ASV class. That is something that needed to be done because of the persostence library. Only this way the code for generating the environment could be safely tested without making changes to the tests - you fix the logic of generating the structure of objects outside of ASV tests then you go to ASV tests and use it wherever you want. So the class although monolythic has this main function. That makes the setup code llittle more hard to understand but easy to reuse outside of ASV as it does not follow ASV guidelines. But the benchamrk tests are mainly because of the tests not about the setup struct, so once developed the structure would be unwise to change due to loss of history for benchamrks.
The proposed version indeed makes at first glance perhaps things more clear during the review process as the code will be inside the ASV test, primarily. Having more decomposed objects perhaps would increase reusability. Still there are things that are missing from the model and the final solution would be more thick and perhaps again not quite clean here and there when missing things are added.
Thus for me it is ok to go to new way of constucting tests, just have to say that during adaptations of this approach to different benchamrks its cleanes might suffer.
|
||
class LibraryType(Enum): | ||
PERSISTENT = "PERMANENT" | ||
MODIFIABLE = "MODIFIABLE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would advise adding the concept of TEST library type which is persistent space for tests. Setting then throgh a property of the generator class set_test() could automatically make the code execute on separate space and not polute PERSISTENT space.
# Using a separate abstraction for the population policy can allow us to be flexible with how we populate libraries. | ||
# E.g. One benchmark can use one population policy for read operations, a different for updates and a third for appends | ||
# Note: Using a kwargs generator to be passed to the df_generator allows customization of dataframe generation params (e.g. num_rows, num_cols, use_string_columns?) | ||
class LibraryPopulationPolicy(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed with Ivo that in previous version the setup environment class was responsible for both library and symbol generation. In this version it will be responsible only for symbol population. Then ASV test will invoke it for each library it needs to populate.
Perhaps this would be ammended also with other methods, attributes:
- number of rows
- number of cols
- number version
- number of snapshots
- etc.
I would suggest that populate_library
and populate_persistent_library_if_missing
are added to this class also.
I do not beleive we will have single only way to populate_library
... Or if we have single way it will be only that way. And then if we need more way to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I aggree as we discussed offline, we need more complex LibraryPopulationPolicy
, to accomodate for many versions/many snapshots.
I'd be happy to move the populate_library
as part of the class, although I want to avoid the sitation where we end up with a complex inheritance structure of these population policies.
Probably just a single Policy which allows many symbols / many versions / snapshot frequency can serve for most our tests
|
||
def time_read_with_column_float(self, storage_info, num_rows): | ||
ac = get_arctic_client(LMDBReadWrite.storage) | ||
population_policy = PerRowsPopulationPolicy(LMDBReadWrite.params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How and where we write tests in this mode for populators for differenet needs?
- should it be part of the ASV class?
- or should it be outside of it? Where? Any suggestions/ideas?
(in previous option all tests were in environment_setup.py)
pass | ||
|
||
|
||
# Currently we're using the same arctic client for both persistant and modifiable libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest having a separate class for those 3 functions :
get_arctic_client
get_library_name
get_library
clear_all_modifiable_libs_from_this_process
Perhaps SharedStorageAccess or better name?
# It is quite clear what is this responsible for: only dataframe generation | ||
# Using such an abstraction can help us deduplicate the dataframe generation code between the different `EnvironmentSetup`s | ||
# Note: We use a class instead of a generator function to allow caching of dataframes in the state | ||
class DataFrameGenerator(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily opposed to using a class, but if caching is the only thing we need it for, we can use something like this instead.
This commit adds some brief code example of how I think we can simplify the current framework.
The main idea is that currently the
EnvConfigurationBase
is responsible for too many things at the same time:It being responsible for a large bundle of things makes it hard to:
EnvConfigurationBase
for each benchmark)My proposal is not code complete but show how we might go about simplifying this framework. New code is inside the
environment_setup_refactored.py
and show how we might go about replacing the currentenvironment_setup.py
.I've also tweaked the
LMDBReadWrite
benchmark to show how we can use the new framework.The proposal mainly consists of separating the four responsibilities of the
EnvConfigurationBase
DataFrameGenerator
)LibraryPopulationPolicy
)By separating these we can get better reusablity:
Also now when I see the code in the benchmark I can how the libraries are created/populated instead of an opaque
setup
call which is unclear what it does.I have left various inline comments explaining why I think the new approach is better/clearer. Any comments welcome.