Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the need to annotate supervised models with a fitresult type #93

Closed
ablaom opened this issue Feb 26, 2019 · 2 comments
Closed

Remove the need to annotate supervised models with a fitresult type #93

ablaom opened this issue Feb 26, 2019 · 2 comments
Assignees

Comments

@ablaom
Copy link
Member

ablaom commented Feb 26, 2019

Recall that presently one is required to declare the type of fitresult returned by a supervised model's fit method in the model struct declaration, as in:

const FitResultType{F} = Tuple(Matrix{F},Vector{F})
mutable struct KNNRegressor{F} <: Deterministic{FitResultType{F}} 
...
end

One is allowed to use Any as the fitresult type, but for efficient ensembling of your model this should be concrete. This is a pain point because the concrete type can be quite complicated. It also may depend on the data and/or hyperparameters in ways that are difficult to anticipate - DecisionTree.jl is a case-in-point - although this is arguably a design flaw at the level of the algorthim being wrapped.

There may be other reasons for declaring this type but I think ensembling is the main one. However, on reviewing ensembling (a legacy of Koala), I don't see why the type could not be inferred by the compiler if this code is refactored appropriately, and we could potentially simplify the API.

Investigating this is not currently a priority for me, but I am optimistically opening this issue anyway. For now I think model API implementers should feel free to declare their fitresult types as Any.

@ablaom
Copy link
Member Author

ablaom commented May 1, 2019

Progress report:

I have benchmarked 9 atomic models in ensembles of size 500, run on the existing code and smarter alternative code that does not need the fitrestult types (see below). The only statistically significant differences detected are, from a practical point of view, slight. The left and right columns refer to the 95% confidence intervals for run times.

So I will be proceeding with the refactoring of MLJBase, MLJ, MLJModels to eliminate the pesky fit-result type requirement. This will simplify the model interface and eliminate some confusion. This refactoring does not effect regular MLJ users, except advanced users building learning networks, and only in a miniscule way. I hope to have the changes ready in 2-4 days.

I am sorry I did not get this right from the beginning. This represents some of the earliest code I wrote for Koala.

run on alternative code (no use of atomic model fit-result type)

│ Row │ model                         │ left      │ right     │ memory    │ allocs │
│     │ Union{Missing, String}        │ Float64⍰  │ Float64⍰  │ Int64⍰    │ Int64⍰ │
├─────┼───────────────────────────────┼───────────┼───────────┼───────────┼────────┤
│ 1   │ ConstantClassifier @ 1…95     │ 2.05649e7 │ 2.09617e7 │ 9334208   │ 56142  │
│ 2   │ SVMClassifier @ 7…15          │ 5.60209e8 │ 5.8366e8  │ 26074448  │ 296399 │
│ 3   │ GPClassifier @ 6…49           │ 6.46297e8 │ 7.47014e8 │ 358806368 │ 75646  │
│ 4   │ DecisionTreeClassifier @ 1…62 │ 5.94981e7 │ 6.07129e7 │ 16535376  │ 103086 │
│ 5   │ OLSRegressor @ 5…03           │ 1.92026e8 │ 2.29164e8 │ 121889568 │ 73646  │
│ 6   │ RidgeRegressor @ 1…21         │ 9.29217e7 │ 9.48835e7 │ 78529600  │ 148147 │
│ 7   │ ConstantRegressor @ 1…62      │ 2.71332e7 │ 2.77541e7 │ 29765552  │ 49146  │
│ 8   │ KNNRegressor @ 1…05           │ 7.81976e7 │ 1.03665e8 │ 49393600  │ 52646  │
│ 9   │ DecisionTreeRegressor @ 1…49  │ 5.36255e8 │ 5.88652e8 │ 80310016  │ 255238 │

run on master (using declared fit result type)

julia> master = CSV.read("master.csv")
9×5 DataFrames.DataFrame
│ Row │ model                         │ left      │ right     │ memory    │ allocs │
│     │ Union{Missing, String}        │ Float64⍰  │ Float64⍰  │ Int64⍰    │ Int64⍰ │
├─────┼───────────────────────────────┼───────────┼───────────┼───────────┼────────┤
│ 1   │ ConstantClassifier @ 1…84     │ 2.00471e7 │ 2.0614e7  │ 9338224   │ 56138  │
│ 2   │ SVMClassifier @ 2…15          │ 5.0585e8  │ 5.20881e8 │ 25998624  │ 291405 │
│ 3   │ GPClassifier @ 8…22           │ 6.35922e8 │ 7.409e8   │ 358810384 │ 75642  │
│ 4   │ DecisionTreeClassifier @ 1…37 │ 5.87301e7 │ 6.0047e7  │ 16539120  │ 103075 │
│ 5   │ OLSRegressor @ 4…98           │ 2.00281e8 │ 2.28798e8 │ 121893584 │ 73642  │
│ 6   │ RidgeRegressor @ 1…24         │ 9.58506e7 │ 1.04885e8 │ 78533616  │ 148143 │
│ 7   │ ConstantRegressor @ 8…27      │ 2.69583e7 │ 2.75019e7 │ 29769568  │ 49142  │
│ 8   │ KNNRegressor @ 6…08           │ 1.01598e8 │ 1.33261e8 │ 68845616  │ 54142  │
│ 9   │ DecisionTreeRegressor @ 2…85  │ 5.37627e8 │ 5.78552e8 │ 80313952  │ 255233 │

julia> 

@ablaom
Copy link
Member Author

ablaom commented May 9, 2019

Resolved on all MLJBase/MLJModels/MLJ master branches. Live in upcoming releases 0.2.0, 0.2.0, 0.2.0, respectively

@ablaom ablaom closed this as completed May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant