Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations of LinearModel.predict implementation #267

Closed
wants to merge 16 commits into from
Closed

Optimizations of LinearModel.predict implementation #267

wants to merge 16 commits into from

Conversation

matt-graham
Copy link
Collaborator

Related to #207. Refactors implementation of LinearModel.predict method to remove some of the inefficiencies in the current implementation.

Below are the SnakeViz outputs for a profiling run (using updated version of scale_run.py script in this PR) with a population size of 20 000 and simulation period of 2 years using the current HEAD of master at the time of writing (cae6fb7). The first shows the overall breakdown and the second the subset of the call time spend in the LinearModel.predict function.

image
image

Currently LinearModel.predict calls Predictor.predict for each of its predictors (lm.py:86(predict) block in second plot above), with each (non-callback) predictor calling the eval method on the input dataframe for each condition in the predictor. Each eval call involves populating a dictionary of 'column resolvers' for each of the columns in the dataframe (generic.py:526(_get_cleaned_column_resolvers) block in second plot above), with these then used to map from names to column values in the parsed expression. For dataframes with many columns constructing this dictionary can be a substantial overhead in each eval call - in the profiling run 83% (236s / 284s) of the eval call time was spend in constructing the column resolvers. Another 6% (16.3s / 284s) of the time is spent constructing a resolver for the dataframe index which can be accessed from the name index in the expression passed to eval. However, this feature is not used in any of the current linear model implementations. The remaining 11% (31.5s / 284s) is spent on actually evaluating the expression i.e. parsing it (8%, 21.4s / 284s) and executing it with chosen engine (3%, 8.29s / 284s). This low percentage is reflective of the fact that the expressions for most individual predictor conditions are relatively simple.

Of the remaining time spent in LinearModel.predict, most is spent in pandas.Series.__setitem__ in Predictor.predict (46.2s) and pandas.DataFrame.__setitem__ in LinearModel.predict (24.4s), corresponding respectively to I believe the lines

TLOmodel/src/tlo/lm.py

Lines 121 to 122 in cae6fb7

# update elements in the output series with the corresponding value for the condition
output[mask] = value

and

TLOmodel/src/tlo/lm.py

Lines 232 to 233 in cae6fb7

for predictor in self.predictors:
res_by_predictor[predictor] = predictor.predict(df)

i.e. assigning the values for the conditions matched so far in the predictor to the series recording the output and writing these output series back to pre-predictor results dataframe.

As mentioned in #207 the code also currently creates a new dataframe when calling df.assign in

TLOmodel/src/tlo/lm.py

Lines 210 to 215 in cae6fb7

# addition external variables used in the model but not part of the population dataframe
if kwargs:
new_columns = {}
for column_name, value in kwargs.items():
new_columns[f'__{column_name}__'] = kwargs[column_name]
df = df.assign(**new_columns)

when there are external variables specified in the model, though in practice this seems to only be a small overhead (0.012s cumulatively spent in DataFrame.assign in the profiling run).

This PR proposes several related changes to try to reduce some of the overheads described above:

  1. Rather than calling DataFrame.eval for each condition of each predictor in a model, a single expression string is built up corresponding to the model output for all the (non-callback) model predictors and then this expression string evaluated in a single eval call. This means much more of the eval calls are spent in parsing and evaluating the passed expressions rather than in repeatedly building the resolver dictionaries. Further the expression string can be built once upon initialisation of the model and then reused for all predict calls. I had also hoped this could give some efficiency gain when numexpr is available as the Pandas documentation on using eval to improve performance suggests there can be substantial gains when using eval with the numexpr engine when using dataframes with more than 10 000 rows and that

    The larger the frame and the larger the expression the more speedup you will see from using eval().

    In practice the profiling results (see below) suggest that there isn't currently any gain from using numexpr. I believe this is largely due to the expression strings being unable to be evaluated with the numexpr in many cases, with the code currently falling back to the python engine in this case, due to the use of non-numexpr compatible syntax in some of the predictors in many models such as use of methods on columns such as between etc. As the expression ends up getting parsed twice in cases where numexpr fails to evaluate, the overhead from this outweighs any gains in performance when numexpr is successfully used. While currently there is therefore no advantage of using numexpr, the refactoring in this PR would potentially also make future gains in performance possible if more of the linear models parsed to expression strings with numexpr compatible syntax, either via more intelligent construction of the parsing string or defining some additional helper methods on the predictors corresponding to the most common non-numexpr compatible syntax.

  2. The column resolvers are manually constructed in each LinearModel.predict call, with only the columns corresponding to names used in the predictors iterated over from the dataframe, rather than iterating over all columns. The external variables are also added directly to the column resolvers rather than assigning to a new dataframe, although noted above this does not seem to be a significant overhead in practice anyway. Currently no index resolvers are constructed as the index name is not currently used in any downstream code, but this would be easy to add if required.

Some additional checks are also added to the tests in tests/test_lm.py to cover some additional edge cases encountered while implementing these changes (specifically ensuring models with only an intercept and no predictors output correct predictions and giving valid outputs for dataframes with columns of categorical datatype with integer categories as Pandas treatment of such columns in eval is quite brittle) and also some of the checks involving comparisons of floating point values relaxed to allow for differences due to accumulated floating point error when e.g. doing reduction operations in a different order.

With the changes in this PR the SnakeViz outputs for an equivalent profiling run (2 years / 20 000 population) as shown above without numexpr installed in the running environment are as follows

image
image

The time spent in LinearModel.predict is 27% of previously (93.2s / 348s) with overall the total runtime 88% of previously (1927s / 2181s). Looking at the breakdown of the time spent in LinearModel.predict in the second plot, we can see that 93% (87.1s / 93.2s) is spent in parsing and evaluating the expression, with the overhead from constructing the resolvers etc. now minimal.

Equivalent SnakeViz outputs for for a run with numexpr installed in the running environment are as follows

image
image

There is still a net gain in performance albeit smaller than without numexpr. We see that significantly more time is spent in parsing the expressions when calling eval (78.6s vs 58.6s) potentially due to the duplication of parsing twice for models in which initially evaluating with numexpr fails.

@tamuri
Copy link
Collaborator

tamuri commented May 17, 2021

Looks good! As discussed:

  1. Can remove the numexpr try/catch.
  2. As a sanity check, perform a long sim without your changes & save the population dataframe as the end. Then run again with your changes (and same seed) - population should be identical.

@matt-graham
Copy link
Collaborator Author

I've pushed a commit which changes to always using the engine="python" in the eval call rather than first trying engine="numexpr". I have also added a commit which makes the check for whether names in the predictor(s) correspond to external variables more robust by only checking in the keyword argument dictionary when the name is pre- and post-fixed by double underscores, rather than just stripping the double underscores and checking (which will give false positives when a non-underscore surrounded name is present in the keyword arguments).

With regards to point (2) for an initial test with a simulation with population size 10 000 over two years (starting with a short run as an initial check), I get that

(((df_after == df_before) | (df_after.isna() & df_before.isna()))[:init_pop_size]).all().all()  == True

where df_after is the final population dataframe generated by a run using the code on this PR at commit 3792fe3, df_before is the corresponding final population dataframe generated by a run using the commit (cae6fb7) on master the PR branch originated from, and init_pop_size is the initial population size (10 000).

The final 400 rows of the two data frames do not exactly match however, specifically in the values in the columns 'nb_low_birth_weight_status', 'nb_size_for_gestational_age', and 'nb_kangaroo_mother_care'. The former two seem to be initialised based on binning a birth_weight value randomly sampled from a normal distribution, and nb_kangaroo_mother_care appears to be computed partially based on the value of nb_low_birth_weight_status.

I initially thought that the discrepancies in these values before and after the changes made in this PR were probably due to the random number generator states differing at the point of the births which generate the rows beyond the initial population size. However after rerunning the simulations with the final random number generator states being saved, the states of the simulation and module level RNGs match exactly at the end of the simulation runs using the code before and after the changes in this PR. It's still possible that the states are out of sync at some intermediate point in the simulation before getting back in to sync by the end, however that feels quite implausible. Hence it seems that something else may be at play.

The exact floating point output of LinearModel.predict depends on the order in which the predictor values are accumulated over (as in general floating point addition and multiplication are not associative). The order of accumulation is changed in this PR compared to master as (i) the intercept is accumulated after any non-callback predictors but before any callback predictors and (ii) any callback predictors are accumulated after all the non-callback predictors rather than predictors being accumulated in the exact order passed to the LinearModel initializer. Other than via an effect on the random number generator states , I cannot see how these very small discrepancies in the predict outputs would lead to the differences observed in the final simulation output, but I will try and do some more investigations to see if this might be the cause.

@tamuri
Copy link
Collaborator

tamuri commented May 17, 2021

Check the RandomState object (self.rng) for the module is being passed to all methods used for sampling. They should use self.rng.xyz for numpy random sampling. For scipy distributions, you need to pass the argument random_state=self.rng in the call. I'll check this.

@tamuri
Copy link
Collaborator

tamuri commented May 17, 2021

Here's one bug, need to fix this:

birth_weight = np.random.normal(loc=params['mean_birth_weights'][mean_birth_weight_list_location],

should be

birth_weight = self.rng.normal(loc=...

@tamuri
Copy link
Collaborator

tamuri commented May 17, 2021

Great work, btw!

@matt-graham
Copy link
Collaborator Author

Here's one bug, need to fix this:

birth_weight = np.random.normal(loc=params['mean_birth_weights'][mean_birth_weight_list_location],

should be

birth_weight = self.rng.normal(loc=...

Thanks - despite looking at that bit of code on and off all day I managed to miss that it was using np.random rather than the seeded RandomState 😅 . I will rerun the before and after comparison of the output tomorrow to verify that this remove the discrepancy, but given the birth_weight is used in calculating all the columns in which the values differed this seems likely to be the cause!

@matt-graham
Copy link
Collaborator Author

Now merged in @joehcollins fix for #274. I have also added a check that the intercept passed to the model initializer is finite which resolves #262. The changes to the eval call also resolves #264. The final population dataframe and RNG states for simulations of population 10 000 for 2 years with the latest commit here (72c0bcd) and the tip of @joehcollins's jcollins_rng_fix_may2021 branch (6c9b2bd) match exactly now. @tamuri shall I do longer runs to check for the outputs matching or is what I've done so far sufficient?

@tamuri
Copy link
Collaborator

tamuri commented May 18, 2021

Thanks, Matt - that's great. You don't have to do longer runs. Have to check why CI isn't running on your PR. We might have disabled for fork PRs. Will check, review this PR and merge.

@matt-graham
Copy link
Collaborator Author

I have submitted a new PR #275 from a duplicate in the main repository of the fork branch this PR was based on to allow Github Actions checks to run therefore closing this PR now to avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants