Optimizations of `LinearModel.predict` implementation #267

matt-graham · 2021-05-07T15:51:40Z

Related to #207. Refactors implementation of LinearModel.predict method to remove some of the inefficiencies in the current implementation.

Below are the SnakeViz outputs for a profiling run (using updated version of scale_run.py script in this PR) with a population size of 20 000 and simulation period of 2 years using the current HEAD of master at the time of writing (cae6fb7). The first shows the overall breakdown and the second the subset of the call time spend in the LinearModel.predict function.

Currently LinearModel.predict calls Predictor.predict for each of its predictors (lm.py:86(predict) block in second plot above), with each (non-callback) predictor calling the eval method on the input dataframe for each condition in the predictor. Each eval call involves populating a dictionary of 'column resolvers' for each of the columns in the dataframe (generic.py:526(_get_cleaned_column_resolvers) block in second plot above), with these then used to map from names to column values in the parsed expression. For dataframes with many columns constructing this dictionary can be a substantial overhead in each eval call - in the profiling run 83% (236s / 284s) of the eval call time was spend in constructing the column resolvers. Another 6% (16.3s / 284s) of the time is spent constructing a resolver for the dataframe index which can be accessed from the name index in the expression passed to eval. However, this feature is not used in any of the current linear model implementations. The remaining 11% (31.5s / 284s) is spent on actually evaluating the expression i.e. parsing it (8%, 21.4s / 284s) and executing it with chosen engine (3%, 8.29s / 284s). This low percentage is reflective of the fact that the expressions for most individual predictor conditions are relatively simple.

Of the remaining time spent in LinearModel.predict, most is spent in pandas.Series.__setitem__ in Predictor.predict (46.2s) and pandas.DataFrame.__setitem__ in LinearModel.predict (24.4s), corresponding respectively to I believe the lines

TLOmodel/src/tlo/lm.py

Lines 121 to 122 in cae6fb7

    
           # update elements in the output series with the corresponding value for the condition 
        
           output[mask] = value

and

TLOmodel/src/tlo/lm.py

Lines 232 to 233 in cae6fb7

    
           for predictor in self.predictors: 
        
               res_by_predictor[predictor] = predictor.predict(df)

i.e. assigning the values for the conditions matched so far in the predictor to the series recording the output and writing these output series back to pre-predictor results dataframe.

As mentioned in #207 the code also currently creates a new dataframe when calling df.assign in

TLOmodel/src/tlo/lm.py

Lines 210 to 215 in cae6fb7

    
           # addition external variables used in the model but not part of the population dataframe 
        
           if kwargs: 
        
               new_columns = {} 
        
               for column_name, value in kwargs.items(): 
        
                   new_columns[f'__{column_name}__'] = kwargs[column_name] 
        
               df = df.assign(**new_columns)

when there are external variables specified in the model, though in practice this seems to only be a small overhead (0.012s cumulatively spent in DataFrame.assign in the profiling run).

This PR proposes several related changes to try to reduce some of the overheads described above:

Rather than calling DataFrame.eval for each condition of each predictor in a model, a single expression string is built up corresponding to the model output for all the (non-callback) model predictors and then this expression string evaluated in a single eval call. This means much more of the eval calls are spent in parsing and evaluating the passed expressions rather than in repeatedly building the resolver dictionaries. Further the expression string can be built once upon initialisation of the model and then reused for all predict calls. I had also hoped this could give some efficiency gain when numexpr is available as the Pandas documentation on using eval to improve performance suggests there can be substantial gains when using eval with the numexpr engine when using dataframes with more than 10 000 rows and that

The larger the frame and the larger the expression the more speedup you will see from using eval().

In practice the profiling results (see below) suggest that there isn't currently any gain from using numexpr. I believe this is largely due to the expression strings being unable to be evaluated with the numexpr in many cases, with the code currently falling back to the python engine in this case, due to the use of non-numexpr compatible syntax in some of the predictors in many models such as use of methods on columns such as between etc. As the expression ends up getting parsed twice in cases where numexpr fails to evaluate, the overhead from this outweighs any gains in performance when numexpr is successfully used. While currently there is therefore no advantage of using numexpr, the refactoring in this PR would potentially also make future gains in performance possible if more of the linear models parsed to expression strings with numexpr compatible syntax, either via more intelligent construction of the parsing string or defining some additional helper methods on the predictors corresponding to the most common non-numexpr compatible syntax.
The column resolvers are manually constructed in each LinearModel.predict call, with only the columns corresponding to names used in the predictors iterated over from the dataframe, rather than iterating over all columns. The external variables are also added directly to the column resolvers rather than assigning to a new dataframe, although noted above this does not seem to be a significant overhead in practice anyway. Currently no index resolvers are constructed as the index name is not currently used in any downstream code, but this would be easy to add if required.

Some additional checks are also added to the tests in tests/test_lm.py to cover some additional edge cases encountered while implementing these changes (specifically ensuring models with only an intercept and no predictors output correct predictions and giving valid outputs for dataframes with columns of categorical datatype with integer categories as Pandas treatment of such columns in eval is quite brittle) and also some of the checks involving comparisons of floating point values relaxed to allow for differences due to accumulated floating point error when e.g. doing reduction operations in a different order.

With the changes in this PR the SnakeViz outputs for an equivalent profiling run (2 years / 20 000 population) as shown above without numexpr installed in the running environment are as follows

The time spent in LinearModel.predict is 27% of previously (93.2s / 348s) with overall the total runtime 88% of previously (1927s / 2181s). Looking at the breakdown of the time spent in LinearModel.predict in the second plot, we can see that 93% (87.1s / 93.2s) is spent in parsing and evaluating the expression, with the overhead from constructing the resolvers etc. now minimal.

Equivalent SnakeViz outputs for for a run with numexpr installed in the running environment are as follows

There is still a net gain in performance albeit smaller than without numexpr. We see that significantly more time is spent in parsing the expressions when calling eval (78.6s vs 58.6s) potentially due to the duplication of parsing twice for models in which initially evaluating with numexpr fails.

tamuri · 2021-05-17T08:29:52Z

Looks good! As discussed:

Can remove the numexpr try/catch.
As a sanity check, perform a long sim without your changes & save the population dataframe as the end. Then run again with your changes (and same seed) - population should be identical.

matt-graham · 2021-05-17T16:03:00Z

I've pushed a commit which changes to always using the engine="python" in the eval call rather than first trying engine="numexpr". I have also added a commit which makes the check for whether names in the predictor(s) correspond to external variables more robust by only checking in the keyword argument dictionary when the name is pre- and post-fixed by double underscores, rather than just stripping the double underscores and checking (which will give false positives when a non-underscore surrounded name is present in the keyword arguments).

With regards to point (2) for an initial test with a simulation with population size 10 000 over two years (starting with a short run as an initial check), I get that

(((df_after == df_before) | (df_after.isna() & df_before.isna()))[:init_pop_size]).all().all()  == True

where df_after is the final population dataframe generated by a run using the code on this PR at commit 3792fe3, df_before is the corresponding final population dataframe generated by a run using the commit (cae6fb7) on master the PR branch originated from, and init_pop_size is the initial population size (10 000).

The final 400 rows of the two data frames do not exactly match however, specifically in the values in the columns 'nb_low_birth_weight_status', 'nb_size_for_gestational_age', and 'nb_kangaroo_mother_care'. The former two seem to be initialised based on binning a birth_weight value randomly sampled from a normal distribution, and nb_kangaroo_mother_care appears to be computed partially based on the value of nb_low_birth_weight_status.

I initially thought that the discrepancies in these values before and after the changes made in this PR were probably due to the random number generator states differing at the point of the births which generate the rows beyond the initial population size. However after rerunning the simulations with the final random number generator states being saved, the states of the simulation and module level RNGs match exactly at the end of the simulation runs using the code before and after the changes in this PR. It's still possible that the states are out of sync at some intermediate point in the simulation before getting back in to sync by the end, however that feels quite implausible. Hence it seems that something else may be at play.

The exact floating point output of LinearModel.predict depends on the order in which the predictor values are accumulated over (as in general floating point addition and multiplication are not associative). The order of accumulation is changed in this PR compared to master as (i) the intercept is accumulated after any non-callback predictors but before any callback predictors and (ii) any callback predictors are accumulated after all the non-callback predictors rather than predictors being accumulated in the exact order passed to the LinearModel initializer. Other than via an effect on the random number generator states , I cannot see how these very small discrepancies in the predict outputs would lead to the differences observed in the final simulation output, but I will try and do some more investigations to see if this might be the cause.

tamuri · 2021-05-17T18:38:20Z

Check the RandomState object (self.rng) for the module is being passed to all methods used for sampling. They should use self.rng.xyz for numpy random sampling. For scipy distributions, you need to pass the argument random_state=self.rng in the call. I'll check this.

tamuri · 2021-05-17T18:47:38Z

Here's one bug, need to fix this:

TLOmodel/src/tlo/methods/newborn_outcomes.py

Line 471 in 82a0c3b

    
           birth_weight = np.random.normal(loc=params['mean_birth_weights'][mean_birth_weight_list_location],

should be

birth_weight = self.rng.normal(loc=...

tamuri · 2021-05-17T18:53:24Z

Great work, btw!

matt-graham · 2021-05-17T20:32:45Z

Here's one bug, need to fix this:

TLOmodel/src/tlo/methods/newborn_outcomes.py

Line 471 in 82a0c3b

birth_weight = np.random.normal(loc=params['mean_birth_weights'][mean_birth_weight_list_location],

should be

birth_weight = self.rng.normal(loc=...

Thanks - despite looking at that bit of code on and off all day I managed to miss that it was using np.random rather than the seeded RandomState 😅 . I will rerun the before and after comparison of the output tomorrow to verify that this remove the discrepancy, but given the birth_weight is used in calculating all the columns in which the values differed this seems likely to be the cause!

@joehcollins

… mmg/lm-predict-opt Merging @joehcollins fix for #274 which was identified while reviewing #267

matt-graham · 2021-05-18T13:20:10Z

Now merged in @joehcollins fix for #274. I have also added a check that the intercept passed to the model initializer is finite which resolves #262. The changes to the eval call also resolves #264. The final population dataframe and RNG states for simulations of population 10 000 for 2 years with the latest commit here (72c0bcd) and the tip of @joehcollins's jcollins_rng_fix_may2021 branch (6c9b2bd) match exactly now. @tamuri shall I do longer runs to check for the outputs matching or is what I've done so far sufficient?

tamuri · 2021-05-18T13:48:27Z

Thanks, Matt - that's great. You don't have to do longer runs. Have to check why CI isn't running on your PR. We might have disabled for fork PRs. Will check, review this PR and merge.

matt-graham · 2021-05-19T09:03:55Z

I have submitted a new PR #275 from a duplicate in the main repository of the fork branch this PR was based on to allow Github Actions checks to run therefore closing this PR now to avoid confusion.

matt-graham added 9 commits May 7, 2021 14:41

Adding additional module dependencies

58040ca

Initial version of LM predict refactoring

64d26fc

Adding some docstrings and comments.

5557cfb

Adjusting population size for profiling run

071c50e

Adding caching of column resolvers

5170c28

Wrap linear model attributes as properties

6c22ab1

Parse predictors in initialiser and use resolver for all column access

6c2147a

Extra checks for intercept only model

caf86f4

Tweaks to ensure correct dtype of linear model output

7d2ef23

matt-graham added 2 commits May 17, 2021 15:24

Always use python engine in eval rather than numexpr

b84fac9

Make external variable name check more robust

3792fe3

matt-graham mentioned this pull request May 18, 2021

Outcome of test in tests_newborn_outcomes depends on RNG state #274

Closed

joehcollins and others added 4 commits May 18, 2021 12:43

fix to test failing with change of seed.

6c9b2bd

Adding check that intercept is finite

7b93b5e

Remove trailing whitespace to keep isort happy

bb3f0cc

Merge remote-tracking branch 'upstream/jcollins_rng_fix_may2021' into…

72c0bcd

… mmg/lm-predict-opt Merging @joehcollins fix for #274 which was identified while reviewing #267

This was linked to issues May 18, 2021

Linear model ignores NaN intercept values #262

Closed

Runs with modules using LinearModel fail when numexpr installed in environment #264

Closed

Outcome of test in tests_newborn_outcomes depends on RNG state #274

Closed

matt-graham mentioned this pull request May 19, 2021

Optimizations of LinearModel.predict implementation (main repo branch) #275

Merged

Tweaks to get Flake8 checks to pass

05d274f

matt-graham closed this May 19, 2021

matt-graham mentioned this pull request Jul 30, 2021

Vectorising health seeking behaviour poll event #346

Merged

matt-graham mentioned this pull request Aug 25, 2021

Improving performance: profiling scale_run simulations #286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations of `LinearModel.predict` implementation #267

Optimizations of `LinearModel.predict` implementation #267

matt-graham commented May 7, 2021

tamuri commented May 17, 2021

matt-graham commented May 17, 2021

tamuri commented May 17, 2021

tamuri commented May 17, 2021

tamuri commented May 17, 2021

matt-graham commented May 17, 2021

matt-graham commented May 18, 2021

tamuri commented May 18, 2021 •

edited

Loading

matt-graham commented May 19, 2021

	# update elements in the output series with the corresponding value for the condition
	output[mask] = value

	for predictor in self.predictors:
	res_by_predictor[predictor] = predictor.predict(df)

	# addition external variables used in the model but not part of the population dataframe
	if kwargs:
	new_columns = {}
	for column_name, value in kwargs.items():
	new_columns[f'__{column_name}__'] = kwargs[column_name]
	df = df.assign(**new_columns)

Optimizations of LinearModel.predict implementation #267

Optimizations of LinearModel.predict implementation #267

Conversation

matt-graham commented May 7, 2021

tamuri commented May 17, 2021

matt-graham commented May 17, 2021

tamuri commented May 17, 2021

tamuri commented May 17, 2021

tamuri commented May 17, 2021

matt-graham commented May 17, 2021

matt-graham commented May 18, 2021

tamuri commented May 18, 2021 • edited Loading

matt-graham commented May 19, 2021

Optimizations of `LinearModel.predict` implementation #267

Optimizations of `LinearModel.predict` implementation #267

tamuri commented May 18, 2021 •

edited

Loading