-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations of LinearModel.predict
implementation
#267
Conversation
Looks good! As discussed:
|
I've pushed a commit which changes to always using the With regards to point (2) for an initial test with a simulation with population size 10 000 over two years (starting with a short run as an initial check), I get that
where The final 400 rows of the two data frames do not exactly match however, specifically in the values in the columns I initially thought that the discrepancies in these values before and after the changes made in this PR were probably due to the random number generator states differing at the point of the births which generate the rows beyond the initial population size. However after rerunning the simulations with the final random number generator states being saved, the states of the simulation and module level RNGs match exactly at the end of the simulation runs using the code before and after the changes in this PR. It's still possible that the states are out of sync at some intermediate point in the simulation before getting back in to sync by the end, however that feels quite implausible. Hence it seems that something else may be at play. The exact floating point output of |
Check the RandomState object ( |
Here's one bug, need to fix this: TLOmodel/src/tlo/methods/newborn_outcomes.py Line 471 in 82a0c3b
should be
|
Great work, btw! |
Thanks - despite looking at that bit of code on and off all day I managed to miss that it was using |
… mmg/lm-predict-opt Merging @joehcollins fix for #274 which was identified while reviewing #267
Now merged in @joehcollins fix for #274. I have also added a check that the intercept passed to the model initializer is finite which resolves #262. The changes to the |
Thanks, Matt - that's great. You don't have to do longer runs. Have to check why CI isn't running on your PR. We might have disabled for fork PRs. Will check, review this PR and merge. |
I have submitted a new PR #275 from a duplicate in the main repository of the fork branch this PR was based on to allow Github Actions checks to run therefore closing this PR now to avoid confusion. |
Related to #207. Refactors implementation of
LinearModel.predict
method to remove some of the inefficiencies in the current implementation.Below are the SnakeViz outputs for a profiling run (using updated version of
scale_run.py
script in this PR) with a population size of 20 000 and simulation period of 2 years using the currentHEAD
of master at the time of writing (cae6fb7). The first shows the overall breakdown and the second the subset of the call time spend in theLinearModel.predict
function.Currently
LinearModel.predict
callsPredictor.predict
for each of its predictors (lm.py:86(predict)
block in second plot above), with each (non-callback) predictor calling theeval
method on the input dataframe for each condition in the predictor. Eacheval
call involves populating a dictionary of 'column resolvers' for each of the columns in the dataframe (generic.py:526(_get_cleaned_column_resolvers)
block in second plot above), with these then used to map from names to column values in the parsed expression. For dataframes with many columns constructing this dictionary can be a substantial overhead in eacheval
call - in the profiling run 83% (236s / 284s) of theeval
call time was spend in constructing the column resolvers. Another 6% (16.3s / 284s) of the time is spent constructing a resolver for the dataframe index which can be accessed from the nameindex
in the expression passed toeval
. However, this feature is not used in any of the current linear model implementations. The remaining 11% (31.5s / 284s) is spent on actually evaluating the expression i.e. parsing it (8%, 21.4s / 284s) and executing it with chosen engine (3%, 8.29s / 284s). This low percentage is reflective of the fact that the expressions for most individual predictor conditions are relatively simple.Of the remaining time spent in
LinearModel.predict
, most is spent inpandas.Series.__setitem__
inPredictor.predict
(46.2s) andpandas.DataFrame.__setitem__
inLinearModel.predict
(24.4s), corresponding respectively to I believe the linesTLOmodel/src/tlo/lm.py
Lines 121 to 122 in cae6fb7
and
TLOmodel/src/tlo/lm.py
Lines 232 to 233 in cae6fb7
i.e. assigning the values for the conditions matched so far in the predictor to the series recording the output and writing these output series back to pre-predictor results dataframe.
As mentioned in #207 the code also currently creates a new dataframe when calling
df.assign
inTLOmodel/src/tlo/lm.py
Lines 210 to 215 in cae6fb7
when there are external variables specified in the model, though in practice this seems to only be a small overhead (0.012s cumulatively spent in
DataFrame.assign
in the profiling run).This PR proposes several related changes to try to reduce some of the overheads described above:
Rather than calling
DataFrame.eval
for each condition of each predictor in a model, a single expression string is built up corresponding to the model output for all the (non-callback) model predictors and then this expression string evaluated in a singleeval
call. This means much more of theeval
calls are spent in parsing and evaluating the passed expressions rather than in repeatedly building the resolver dictionaries. Further the expression string can be built once upon initialisation of the model and then reused for allpredict
calls. I had also hoped this could give some efficiency gain whennumexpr
is available as the Pandas documentation on usingeval
to improve performance suggests there can be substantial gains when usingeval
with thenumexpr
engine when using dataframes with more than 10 000 rows and thatIn practice the profiling results (see below) suggest that there isn't currently any gain from using
numexpr
. I believe this is largely due to the expression strings being unable to be evaluated with thenumexpr
in many cases, with the code currently falling back to thepython
engine in this case, due to the use of non-numexpr
compatible syntax in some of the predictors in many models such as use of methods on columns such asbetween
etc. As the expression ends up getting parsed twice in cases wherenumexpr
fails to evaluate, the overhead from this outweighs any gains in performance whennumexpr
is successfully used. While currently there is therefore no advantage of usingnumexpr
, the refactoring in this PR would potentially also make future gains in performance possible if more of the linear models parsed to expression strings withnumexpr
compatible syntax, either via more intelligent construction of the parsing string or defining some additional helper methods on the predictors corresponding to the most common non-numexpr
compatible syntax.The column resolvers are manually constructed in each
LinearModel.predict
call, with only the columns corresponding to names used in the predictors iterated over from the dataframe, rather than iterating over all columns. The external variables are also added directly to the column resolvers rather than assigning to a new dataframe, although noted above this does not seem to be a significant overhead in practice anyway. Currently no index resolvers are constructed as theindex
name is not currently used in any downstream code, but this would be easy to add if required.Some additional checks are also added to the tests in
tests/test_lm.py
to cover some additional edge cases encountered while implementing these changes (specifically ensuring models with only an intercept and no predictors output correct predictions and giving valid outputs for dataframes with columns of categorical datatype with integer categories as Pandas treatment of such columns ineval
is quite brittle) and also some of the checks involving comparisons of floating point values relaxed to allow for differences due to accumulated floating point error when e.g. doing reduction operations in a different order.With the changes in this PR the SnakeViz outputs for an equivalent profiling run (2 years / 20 000 population) as shown above without
numexpr
installed in the running environment are as followsThe time spent in
LinearModel.predict
is 27% of previously (93.2s / 348s) with overall the total runtime 88% of previously (1927s / 2181s). Looking at the breakdown of the time spent inLinearModel.predict
in the second plot, we can see that 93% (87.1s / 93.2s) is spent in parsing and evaluating the expression, with the overhead from constructing the resolvers etc. now minimal.Equivalent SnakeViz outputs for for a run with
numexpr
installed in the running environment are as followsThere is still a net gain in performance albeit smaller than without
numexpr
. We see that significantly more time is spent in parsing the expressions when callingeval
(78.6s vs 58.6s) potentially due to the duplication of parsing twice for models in which initially evaluating withnumexpr
fails.