-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Bayesian Additive Regression Trees (BARTs) #4183
Conversation
Co-authored-by: aloctavodia <aloctavodia@gmail.com> Co-authored-by: jmloyola <jmloyola@outlook.com>
Co-authored-by: aloctavodia <aloctavodia@gmail.com> Co-authored-by: jmloyola <jmloyola@outlook.com>
pymc3/step_methods/pgbart.py
Outdated
vars: list | ||
List of variables for sampler | ||
num_particles : int | ||
Number of particles for the SMC sampler. Defaults to 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMC -> PGBART, same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to follow the nomenclature in the papers and one step of the PGBART is "conditional-SMC" method. But I see how this can be confusing.
pymc3/step_methods/pgbart.py
Outdated
return particles | ||
|
||
|
||
class Particle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class Particle: | |
class ParticleTree: |
I will do a full review next week, but just want to say congrats and looking forward to trying this out! |
Codecov Report
@@ Coverage Diff @@
## master #4183 +/- ##
==========================================
+ Coverage 88.91% 88.93% +0.02%
==========================================
Files 89 92 +3
Lines 14429 14788 +359
==========================================
+ Hits 12829 13152 +323
- Misses 1600 1636 +36
|
Very impactful work, thank you for doing this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont want to block this so feel free to merge when tests passed.
I will test it out on master and file bug if problem arise.
This add BARTs to PyMC3 at the API level this looks (almost) like a new distribution (more on this below). Additionally this add a new sampler specifically designed for BART, the PGBART sampler, this is necessary because trees are a very particular kind of discrete distribution (we can also see them as stepwise functions).
The general idea of BARTs is that given a problem of the form
y = f(X)
, we can approximate the unknown functionf
as a sum ofm
trees. As trees can easily overfit, BARTs put priors over trees to make each tree only capable of explain a little bit of the data (for example trees tend to be shallow) and thus we must sum many trees to get a reasonable good approximation.A 1D example.
The black line is the mean of μ and the band the HDI of μ. As you can see the mean is not a smooth curve because trees are discrete. Notice that this mean is a sum of 50 trees over 2000 posterior draws (2 chains each one of 1000 draws)
This work is the continuation of what @jmloyola did for the GSOC. The main differences are that I reduce most of the trees-code to the essential parts, I try to speed-up things (probably there is still room for improvement) and mainly that I focused on trying to make BART to work inside a probabilistic programming language (I mention this because there is a family of BART methods, in general they are designed with an specific likelihood in mind, and thus they rely on conjugancy). My goal for the BART implementation in PyMC (this will need more PRs) is that BART becomes as flexible as any other distribution, so it can be combined with other distribution to create arbitrary models. At the moment its parameters m and alpha must be floats, not distributions, the main reason is that this is generally the case. There are some reports in the literature saying that putting priors on top of that parameters does not work computationally very well, but this is something I would like to explore.
Some missing features I will like to work on future PRs: Variable selection methods, better test, documentation, store info that could be used for diagnostics. And do some research to better grasp how it behaves for real/complex datasets and way to better select its parameters (loo, CV, priors...)