-
-
Notifications
You must be signed in to change notification settings - Fork 1k
GSoC_2019_project_efficient_ml
... continuing from 2017 GSoC
We are continuing the highly popular project of the last years: the aim is to improve our implementations of fundamental ML algorithms. As this year's focus is on user experiences with Shogun, we focus on finding the bad guys. Who are the bad guys? Those are implementation of algorithms in Shogun that are embarrassingly in one of: runtime, memory efficiency, code-style, API, documentation ... we don't want to embarrass ourselves ;)
While we don't need Shogun to be the fastest/best/most pretty library in all tasks, it at least should not suck. This project is about identifying fixing all those "bad guys".
- Heiko (github: karlnapf, IRC: HeikoS)
- Sergey (github: lisitsyn, IRC: lisitsyn)
- Ryan Curtin from mlpack, IRC: rcurtin
- Marcus Edel from mlpack, IRC: zoq
Medium to difficult, you need to dig into existing code and you will need:
- ML Algorithms in C++
- Re-factoring existing code / design patterns
- Knowledge of basic ML
- Basic Linear Algebra, Shogun's
linalg
framework - Experience with other ML toolkits (preferably Python, such as scikit-learn, or c++ such as mlpack)
- Desirable: Experience with the benchmarking system
- Desirable: The ability to make algorithms more cache friendly
Here are some examples of what topics should be covered.
Have a look at benchmark comparisons of Shogun with other libraries at mlpack's benchmarking framework. You will notices that sometimes Shogun does quite well, like for KMeans
dataset | mlpy | scikit | shogun | weka | mlpack |
---|---|---|---|---|---|
corel-histogram | 3.59s | 0.73s | 1.11s | 19.43s | 1.92s |
mnist | 119.83s | 46.13s | 16.02s | 1558.07s | 61.35s |
On the other hand, there are situations that are less than optimal, like for linear regression, where Shogun fails.
dataset | mlpy | scikit | shogun | weka | mlpack |
---|---|---|---|---|---|
arcene | failure | 0.24s | failure | 3.16s | 0.42s |
cosExp | 0.13s | 0.08s | failure | 17.42s | 0.13s |
Anotoher one is linear ridge regression, where Shogun is extremely slow
dataset | scikit | shogun |
---|---|---|
webpage | 1.94s | >9000s |
Again, we don't want Shogun to be the fastest candidate everywhere. We only don't want it to be the slowest by far.
Example: have a look at GMM.
It has 3 train methods, awkward methods like get_nth_mean
, multiple methods to apply
it (::cluster, ::get_likelihood_example
), etc.
A first step would be to rename the methods to something that looks nice, or to remove them (we have tags so no need for getters/setters anymore).
Next, GMM is nothing else but a supervised learning algorithm, so it should support that interface: fit, predict
, and not offer its own methods.
Next, GMM is also a distribution that can be sampled from, so it should be possible to turn it into an API that supports sampling.
We actually wrote some API desiderata for the user experience project, which overlaps with the project in terms of API. Think: you identify bad API, and how it should be instead, user experience project person implements basics for your changes to be possible, you change the algorithm.
Some bad examples:
- Example of one sentence docs
- Example of no documentation at all
- Example of bad documentation -- no description of what happens or how expensive it is.
- Example of bad documentation -- talks about using the features in ERM, but this class is just about feature embeddings, so it should talk about the embedding, it computational costs, and what one can do with it: pass it to linear algorithms (like a linear SVM).
You get the point...
- Increase coverage of Shogun in the benchmark framework. Ideally all algorithms in the framework should be populate with Shogun
- Make a priority list of algorithms where Shogun doesn't do well: runtime & memory
- Make a list of badly or un-documented algorithm classes (missing
@brief
, one sentence docs) - Make a list of algorithms with awkward API
- Take a single instance and work on it until things are better.
- Whenever you touch the internals, make sure to also polish:
linalg
usage, API, class design - Work on a one-by-one basis
- Whenever you improve something, make sure to provide a "before-after" comparison.
This project offers the chance to learn about many fundamental ML algorithms from a practical perspective, with a focus on usability and efficiency. As we want to start with important algorithms first, it is likely that many people will use (and appreciate) code that you wrote.