Homework 9: Big Data

Lets see if we can learn from large data sets, without holding all that data .

Same as homework 8 but now:

Read in the first 5000 rows, randomize the order
Do unsupervised clustering on the first 500 rows
Then, one a time, dribble in the remaining 4500 rows and only update kids if the new data is anomalous
Then, one a time, dribble in the rows and only update kids if the new data is anomalous

Engineering tips:

Make the nodes of your tree "smart".
- They get new rows, one at a time
  - and only if they are anomalous might they get pushed to subtrees
- When there is enough, that node knows to make its own sub-tree.
Define "anomaly" using the pivots
- havea magic constant α=0.5
- If the cosine distance from east to west is c;
- The if a new row is distance a,b from east west then if falls at distance x along c
  - x = (a^2 + c^2 - b^2) / (2c)
- And if the sub-trees are being split at s
  - if s < 0.5
    - then far = s*α and anomalous is x < far
  - else
    - then far = s+ (1-s)*α and anomalous is x > far

To assess the results:

For two large datasets (xomo10000 and pom310000)

Build a tree using all data (as in prior homeworks).
100 times select rows in a leaf cluster, at random
- Tag each of these probes with the BEFORE values
  - size, mean and standard deviation of the performance scores in their leaf cluster
20 times, rebuild the trees using all the data
- Find the probes
- Tag the probes with the AFTER values:
  - size, mean and standard deviation of the performance scores in their leaf cluster
- Using the code at https://gist.github.com/timm/33578871be53e604da83679dc7ccbcc5, report how often these probes land on the same distributions in AFTER than BEFORE - i.e. when Num.same test passes.
Let baseline be the mean same score found in the above 20 repeats.
Now 20 times repeat:
- build the trees incrementally
- Compute the same score (using the same 100 probes as used above)
Write a table showing the same score seen with all and incremental

Write a file report.txt commenting on how much α effects the same score.

What could go wrong

The baseline score is very low (in which case the random projections are finding wildly different clusters).

If that happens, spend more time finding the pivots (i.e. lesson the "random" in the random projections)