Add Bayesian Additive Regression Trees (BARTs) #4183

aloctavodia · 2020-10-23T14:25:25Z

This add BARTs to PyMC3 at the API level this looks (almost) like a new distribution (more on this below). Additionally this add a new sampler specifically designed for BART, the PGBART sampler, this is necessary because trees are a very particular kind of discrete distribution (we can also see them as stepwise functions).

The general idea of BARTs is that given a problem of the form y = f(X), we can approximate the unknown function f as a sum of m trees. As trees can easily overfit, BARTs put priors over trees to make each tree only capable of explain a little bit of the data (for example trees tend to be shallow) and thus we must sum many trees to get a reasonable good approximation.

A 1D example.

with pm.Model() as model:
    σ = pm.HalfNormal('σ', 1)
    μ = pm.BART('μ', X, Y, m=50)
    y = pm.Normal('y', μ, σ, observed=Y)

The black line is the mean of μ and the band the HDI of μ. As you can see the mean is not a smooth curve because trees are discrete. Notice that this mean is a sum of 50 trees over 2000 posterior draws (2 chains each one of 1000 draws)

This work is the continuation of what @jmloyola did for the GSOC. The main differences are that I reduce most of the trees-code to the essential parts, I try to speed-up things (probably there is still room for improvement) and mainly that I focused on trying to make BART to work inside a probabilistic programming language (I mention this because there is a family of BART methods, in general they are designed with an specific likelihood in mind, and thus they rely on conjugancy). My goal for the BART implementation in PyMC (this will need more PRs) is that BART becomes as flexible as any other distribution, so it can be combined with other distribution to create arbitrary models. At the moment its parameters m and alpha must be floats, not distributions, the main reason is that this is generally the case. There are some reports in the literature saying that putting priors on top of that parameters does not work computationally very well, but this is something I would like to explore.

Some missing features I will like to work on future PRs: Variable selection methods, better test, documentation, store info that could be used for diagnostics. And do some research to better grasp how it behaves for real/complex datasets and way to better select its parameters (loo, CV, priors...)

…cessary errors

Co-authored-by: aloctavodia <aloctavodia@gmail.com> Co-authored-by: jmloyola <jmloyola@outlook.com>

junpenglao · 2020-10-23T15:01:26Z

pymc3/step_methods/pgbart.py

+    vars: list
+        List of variables for sampler
+    num_particles : int
+        Number of particles for the SMC sampler. Defaults to 10


SMC -> PGBART, same below

I tried to follow the nomenclature in the papers and one step of the PGBART is "conditional-SMC" method. But I see how this can be confusing.

junpenglao · 2020-10-23T15:01:49Z

pymc3/step_methods/pgbart.py

+        return particles
+
+
+class Particle:


Suggested change

class Particle:

class ParticleTree:

junpenglao · 2020-10-23T15:02:35Z

I will do a full review next week, but just want to say congrats and looking forward to trying this out!

codecov · 2020-10-23T18:18:05Z

Codecov Report

Merging #4183 (5a7b552) into master (f732a01) will increase coverage by 0.02%.
The diff coverage is 90.02%.

@@            Coverage Diff             @@
##           master    #4183      +/-   ##
==========================================
+ Coverage   88.91%   88.93%   +0.02%     
==========================================
  Files          89       92       +3     
  Lines       14429    14788     +359     
==========================================
+ Hits        12829    13152     +323     
- Misses       1600     1636      +36

Impacted Files	Coverage Δ
pymc3/distributions/bart.py	`80.80% <80.80%> (ø)`
pymc3/distributions/tree.py	`88.60% <88.60%> (ø)`
pymc3/step_methods/pgbart.py	`97.98% <97.98%> (ø)`
pymc3/distributions/__init__.py	`100.00% <100.00%> (ø)`
pymc3/model.py	`89.33% <100.00%> (ø)`
pymc3/sampling.py	`86.88% <100.00%> (+0.04%)`	⬆️
pymc3/step_methods/__init__.py	`100.00% <100.00%> (ø)`
pymc3/step_methods/hmc/nuts.py	`97.48% <100.00%> (+0.01%)`	⬆️
... and 1 more

jlevy44 · 2020-11-11T16:17:48Z

Very impactful work, thank you for doing this!

junpenglao

Dont want to block this so feel free to merge when tests passed.
I will test it out on master and file bug if problem arise.

aloctavodia and others added 17 commits October 9, 2020 08:30

update from master

273187b

black

9d4f73f

minor fix

beaf184

clean code

f11a57f

blackify

43fed87

fix error residuals

f847f66

use a low number of max_stages for the first iteration, remove not ne…

6700a74

…cessary errors

use Rockova prior, refactor prior leaf prob computaion

ac96b1a

clean code add docstring

7d54bfa

reduce code

b566d50

speed-up by fitting a subset of trees per step

0ff5833

choose max

3419e70

improve docstrings

51165d4

refactor and clean code

03758a4

clean docstrings

c3c3929

add tests and minor fixes.

6a58daa

Co-authored-by: aloctavodia <aloctavodia@gmail.com> Co-authored-by: jmloyola <jmloyola@outlook.com>

remove space.

9050469

Co-authored-by: aloctavodia <aloctavodia@gmail.com> Co-authored-by: jmloyola <jmloyola@outlook.com>

junpenglao reviewed Oct 23, 2020

View reviewed changes

junpenglao self-assigned this Oct 23, 2020

add variable importance report

5fdd999

michaelosthege added the enhancements label Oct 30, 2020

jlevy44 mentioned this pull request Oct 31, 2020

Bayesian Additive Regression Trees pyro-ppl/numpyro#793

Open

aloctavodia added 2 commits November 4, 2020 11:58

use ValueError

acc5290

wip return mean and std variable importance

7ac976b

aloctavodia requested a review from junpenglao November 9, 2020 16:12

update variable importance report

78b6f79

aloctavodia added 2 commits November 14, 2020 09:39

update release notes, remove vi hdi report

2dda3b0

Merge branch 'master' into BART

2050958

junpenglao approved these changes Nov 14, 2020

View reviewed changes

aloctavodia added 3 commits November 14, 2020 12:26

test variable importance

bb69a76

Merge branch 'BART' of https://github.com/aloctavodia/pymc3 into BART

a473028

fix test

5a7b552

aloctavodia merged commit e51b9d3 into pymc-devs:master Nov 14, 2020

aloctavodia deleted the BART branch November 14, 2020 17:46

eigenfoo mentioned this pull request Nov 15, 2020

New commits to pymc3/sampling.py or pymc3/step_methods/hmc/ eigenfoo/littlemcmc#99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bayesian Additive Regression Trees (BARTs) #4183

Add Bayesian Additive Regression Trees (BARTs) #4183

aloctavodia commented Oct 23, 2020

junpenglao Oct 23, 2020

aloctavodia Oct 23, 2020

junpenglao Oct 23, 2020

junpenglao commented Oct 23, 2020

codecov bot commented Oct 23, 2020 •

edited

Loading

jlevy44 commented Nov 11, 2020

junpenglao left a comment

Add Bayesian Additive Regression Trees (BARTs) #4183

Add Bayesian Additive Regression Trees (BARTs) #4183

Conversation

aloctavodia commented Oct 23, 2020

junpenglao Oct 23, 2020

Choose a reason for hiding this comment

aloctavodia Oct 23, 2020

Choose a reason for hiding this comment

junpenglao Oct 23, 2020

Choose a reason for hiding this comment

junpenglao commented Oct 23, 2020

codecov bot commented Oct 23, 2020 • edited Loading

Codecov Report

jlevy44 commented Nov 11, 2020

junpenglao left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 23, 2020 •

edited

Loading