Skip to content

Journal Club Presentation

Gavin edited this page Nov 27, 2014 · 10 revisions

% Defying the Curse of Dimensionality: Competitive Seizure Prediction with Kaggle % Gavin Gray % November 28th 2014

What does a Kaggle competition look like?

The view from Github

. . .

Commits by time.

As many people here probably already know what a Kaggle competition is I'm going to describe what they are in a roundabout way that hopefully won't be boring to those people.

To get an idea of what something is, instead of describing it, what if I just show you what it looks like from some different angles? Then you can get an idea of what it is yourself. These are some ways of looking at the Kaggle competition project Scott, Finlay and I were working on.

This is one of the standard graphs github will produce if you ask it to.


Punch card graph of hours when we were working.

Looking at the punchcard you can see when we were making most of our commits in this project.

That might say more about us than it does about the project.

The view from the data

Graph showing example pre-ictal samples from the raw data [@aeskaggle].

The data you can see here is what we were working with. These are recordings from electrodes inside the brains of patients and dogs in two states: preictal and interictal. i.e. just before a seizure and _not_ just before a seizure.

These graphs show some of what was produced from the raw data. We ended up with a vast array of different pre-processing options I'll describe later in the presentation but this should give you an idea of what they are.

Coloured by class.

This is a spectral embedding.

Forms an affinity matrix given by the specified function and applies spectral decomposition to the corresponding graph laplacian. The resulting transformation is given by the value of the eigenvectors for each data point.

This graph illustrates why generalisation in this problem was so difficult. You can see that if our task was to tell the difference between test and training it would be much easier than telling the difference between interictal and preictal.


Coloured by hour.

The view from Kaggle

Final leaderboard results [@aeskaggle].

So we try to predict whether the samples are preictal or interictal, then we supply our predictions on all of these test samples to Kaggle and they score us on the leader board.

That's the competition.

We were scored on a measure of area under the curve (AUC) on an receiver operating characteristic (ROC) curve. Basically, it's a plot of false positive versus true positive rate as the threshold of the classifier was varied.

We could submit ten entries a day, and receive scores on each. So you might think we could just overfit the test data by submitting a huge number of times. There's a catch, the score on the leaderboard during the competition is only calculated from a fraction of the submission, and at the end the real value using all of the test data is revealed.

This means that at the end of the competition, there's a big shakeup of the scores. It worked pretty well for us, as we went up 15 places.

What can you work on?

Historical competitions

A sample from 143 completed competitions:

  • Heritage Health Prize
  • Merck Molecular Activity Challenge
  • Observing Dark Worlds
  • The Marinexplore and Cornell University Whale Detection Challenge
  • Africa Soil Property Prediction Challenge
  • CONNECTOMICS (that is the whole name)
  • Many, many corporate competitions...
Examples of historical competitions go here.

I've tried to pick out a good mix of the different kinds of projects on there, but there's definitely some that are different from those I've picked.

The first one here, the Heritage Health Prize was a competition to predict whether someone would go to hospital based on their previous health problems in the last year. The prize was $500,000 (it's the biggest prize so far awarded).

The Observing Dark Worlds challenge I put there because Iain Murray here at Edinburgh did very well in that one, getting 2nd (narrowly).

The others are a mix. A large proportion of the competitions are corporate analytics: predicting if people will click adds, employee habits etc.

Tools

Free to use anything to get the job done. We used:

  • Matlab
  • Scikit-learn
  • Git
  • Various other Python packages
  • Working with HDF5s
  • MongoDB
Of course, in a Kaggle competition you're free to use any tools you want to use. You could use something that you just want to learn to use, or whatever your favourite tool for the job is.

Here are some of the things we used. We used Matlab for feature preprocessing as the files came in .mat format, which makes them more difficult to open than the documentation for scipy.io says.

Techniques

It's possible to quickly try things out to see if they'll work.

Preprocessing

Comprehensive list of features can be found in the [repository][repo]. Useful extractions were:

  • cln,csp,dwn_feat_pib_ratioBB_:
    • cln - Cleaned
    • csp - Common Spatial Patterns (transformation)
    • dwn - Downsampled
    • pib - Power in band
    • ratioBB - ratio of power to broadband power
  • cln,ica,dwn_feat_mvar-PDC_:
    • ica - Independent Component Analysis (transformation)
    • mvar - coefficients of fitted Multivariate-AutoRegressive model
    • PDC - Partial Directed Coherence for MVAR
  • And approximately 850 other options...
Here are two examples of pre-processing options Scott worked on. I've deciphered the naming convention to make it more clear what each is.

The first of these in the plots we've already seen. Despite being relatively simple, it was very effective, which is consistent with what other teams found. The other feature here also scored highly in our batch tests, but unfortunately we were never able to accurately reproduce those results on the leaderboard, which was frustrating.

There are many, many more of these options which Scott managed to create, all are sitting on the salmon server if you want to take a closer look, but their descriptions you can find on the wiki for the repository.


Machine learning

Here is a list (incomplete) of what we tried:

  • Random Forests
    • Random forest classifiers
    • Totally random tree embedding
    • Extra-tree feature selection
  • Support Vector Machines
    • Various different kernels
  • Logistic Regression
  • Adaboost
  • Platt scaling
  • Univariate feature selection
  • Restricted Boltzmann machine
  • Recursive feature elimination
  • ...

Organising the project

  • Teamwork with git experience
  • TDD
  • Code documentation
Easy to use it as a opportunity to get acquainted with a new technique you might want to use in another project. You can quickly understand how to make your chosen ML algorithm work well, as you see the results right away, and it is a real problem.

Scott mainly worked on the preprocessing in matlab. In the repository there is a vast directory of matlab scripts, which I avoid.

We were also able to put some time into learning development techniques which we wouldn't find a good excuse to look at otherwise. This was largely using unit tests in Python to get some bugs out of the code we were using to build training and test sets from the processed HDF5s.

Tips and tricks in seizure prediction

Our process

Our data flow chart.

We did our preprocessing in Matlab and it seemed like the best way to get this data from Matlab into Python would be to write it to HDF5s for each preprocessing operation and load them into Python as required.

Once Scott was done with it this resulted in around 300GB of HDF5 files from around 30GB of raw data.

Figuring our which of these were actually going to be useful was a massive problem, considering we only had around 4000 samples spread over 7 subjects.

We launched several batch scripts with the hope that we might be able to find a "silver bullet" feature somewhere in there.

This failed, and it proved extremely difficult to find anything better than our hand-picked set of features we chose at the start of classification, once the preprocessing had been finished.

Model averaging

So we were 10 hours from the deadline and I had completely run out of ideas. I'd been trying to improve our score all week with various different forms of feature selection and including other preprocessed features to try to improve it and had turned up nothing. Finlay and Scott had been doing the same thing, and we'd come up with nothing.

In a last ditch attempt to improve our score slightly I just took our two best submission csvs and averaged the predictions.

We immediately jumped up 4 places.

Including some other high performing submissions we were able to jump up several more places before the end of the competition.

Genetic algorithms

Michael Hill's method came up on Github two days ago [@hill]:

...population size of 30 and runs for 10 generations. The population is initialised with random feature masks consisting of roughly 55% features activated and the other 45% masked away. The fitness function is simply a CV ROC AUC score.

His model:

. . .

The default selected classifier for submission is linear regression.

Michael Hill won the previous AES seizure challenge. This time he came 5th, and the method he used was this complicated genetic approach to feature selection.

Linear regression was the method of choice a.k.a. Scikit-learn linear regression with the predict_proba method scaled and sigmoided. Others with high-scoring results also did this, including Jonathan Tapson.

Repository

Our repository can be found at... TODO

Competitive Data Science

Conclusions

. . .

Advantages:

  • Get to try new things
  • Learn new skills
  • Working break from your PhD - you get immediate feedback
  • Might discover something useful

. . .

Disadvantages:

  • Can quickly absorb time
  • You have to have a good team
  • Models people create are not necessarily useful:
    • Netflix challenge
    • Engineered ensemble models are over-complicated

Next?

The next competitions coming up are:

  • BCI Challenge @ NER 2015 - $1,000
  • Helping Santa's Helpers - $20,000
  • Click-Through Rate Prediction - $15,000
If you want to start one now, here are your options.

The first of these is probably the interesting one. The goal is to detect errors in a spelling task, given EEG recordings.

The second involves optimising an objective function by assigning different elves to different toys. Unfortunately, it's by a company called FICO and they want you to use their special software to do it...

The third you're allowed to use whatever you want, but it's not very interesting. And if you're doing a PhD you probably never wanted to work for an advertising company.

Some text.

References