Skip to content

Latest commit

 

History

History
209 lines (179 loc) · 13.5 KB

Week 9: Anomaly Detection and Recommender Systems.md

File metadata and controls

209 lines (179 loc) · 13.5 KB

Week 9

Anomaly detection

Problem motivation

  • Imagine you are making aircraft engines and measuring features of these engines:
    • x(1) = heat generated
    • x(2) = vibration intensity
    • etc.
  • There is a dataset {x(1), x(2), ..., x(m)} of m examples
  • A new engine is considered x(test) and it should be assessed for uniformity or quality assurance that it is similar to other prior make aircraft engines, i.e., ready to ship
  • In essence, this is anomaly detection
  • Formally,
    • Given a dataset of non-anomalous examples
    • Build a model that outputs the probability p(x) that x(test) is not anomalous
    • If p(x(test)) < epsilon -> flag anomaly
    • If p(x(test)) >= epsilon -> OK
  • Examples of anomaly detection
    • Fraud detection:
      • x(i) = features of user i's activities
      • Model p(x) from data
      • Identify unusual users by checking which have p(x) < epsilon
    • Manufacturing (as above)
    • Monitor computers in a data center
      • x(i) = features of machine i
      • Model p(x) from data
      • Identify unusual machine activity concerning for machine failure

Gaussian (normal) distribution

  • Assume x is a real number, if x is a distributed Gaussian with mean μ and variance σ^2,
    • x ~ N(μ, σ^2)
    • p(x; μ, σ^2) = 1/(sqrt(2 * pi) * σ) * exp(- (x-μ)^2 / (2σ^2))
      • p(x; μ, σ^2) means "the probability of x parameterized by μ and σ

Parameter estimation

  • Dataset {x(1), ..., x(m)}
  • Suspect that these examples were pulled from x(i) ~ N(μ, σ^2)
  • μ = (1/m) * sum(x(i)) # the mean
  • σ = (1/m) * sum( (x(i) - μ)^2 )
  • The above μ and σ are the maximum likelihood estimators

Develop an algorithm for anomaly detection

  • Density estimation
    • Training set: {x(1), ..., x(m)}
    • Each example (i) is an n-dimentional vector (i.e., has n features)
    • Model p(x) = p(x1; μ1, σ1^2) * p(x2; μ2, σ2^2) * ... p(xn; μn, σn^2)
      • Assume that each feature has a Gaussian distribution, e.g.,, x1 ~ N(μ1, σ1^2)
      • Product j=1 to n p(xj; μj, σj^2)
      • Note that the above are the n features, not the m examples x(1) ... x(m)
    • The above model assumes that the independence assumption holds true (though even if it doesn't this algorithm functions well) Anomaly detection algorithm:
  1. Choose features x(i) that you think might be indicative of anomalous examples
  2. Fit parameters μ1, ..., μn, σ1^2, ..., σn^2
  3. Given new example x, compute p(x)
  4. Label anomaly if p(x) < epsilon

How to evaluate an anomaly detection algorithm

  • As discuss before, it is important to have a single real-number evaluation of the algorithm
  • Assume the data are labeled as non-anomalous (y=0) and anomalous (y=1)
  • Training set remains unlabeled x(1), ..., x(m)
  • Crossvalidation set is labeled (x(1), y(1)), ..., (x(m_cv), y(m_cv))
  • Test set is (x(1), y(1)), ..., (x(m_test), y(m_test))
  • Aircraft motivating example
    • 10,000 examples of normal engines
    • 20 engines are anomalous
    • How to split the data:
      • Training set: 6000 non-anomalous engines, estimate μ and σ
      • Crossvalidation set: 2000 non-anomalous engines (y=0), 10 anomalous engines (y=1)
      • Test set: 2000 non-anomalous engines (y=0), 10 anomalous engines (y=1)
    • Algorithm evaluation:
      • Fit the model p(x) on the training set
      • On a crossvalidation/test example x, predict:
        • y = 1 if p(x) < epsilon
        • y = 0 if p(x) >= epsilon
      • Possible evaluation metrics:
        • True positive, false positive, false negative, true negative
        • Precision/recall
        • F1-score
      • Can also use crossvalidation set to choose parameter epsilon based on the above evaluation metrics

Anomaly detection vs. supervised learning

Anomaly detection Supervised learning
Very small number of positive examples (i.e., 1-20) Large number of positive and negative examples
Large number of negative examples
Many different "types" of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples seen so far Enough positive examples for algorithm to get a sense of what positive examples like; future examples likely to be similar to ones in training set

Applications of anomaly detection vs. supervised learning

Anomaly detection Supervised learning
fraud detection email spam classification
manufacturing (e.g., aircraft engines) weather prediction
monitoring machines in data center cancer classification

If there are a lot of anomalies, any of the above examples could shift to the supervised learning column.

Optimizing available features for anomaly detection

  • This has a lot of impact on how well the anomaly detection works
  • Plot the data to ensure that it looks vaguely Gaussian
    • Though the algorithm works OK if the data are not Gaussian, data transformations to make the distribution look more Gaussian (e.g., log-transform, square root, cube root)

Error analysis for anomaly detection

  • Want p(x) large for normal examples and p(x) small for anomalous examples
  • Most common problem:
    • p(x) is comparable (say, both large) for normal and anomalous example
  • Look at how the wrongly-classified anomaly and try to find a new feature that is clearly different from the non-anomalous examples

Choosing features for anomaly detection

  • Choose features that might take on unusually large or small values in the event of an anomaly
  • As an example, monitoring a computer cluster:
    • x1 = memory use of computer
    • x2 = number of disk accesses/second
    • x3 = CPU load
    • x4 = network traffic
  • Suppose CPU load and network traffic grow linearly with each other
  • Perhaps one of the failure cases one of the machines will get stuck in an infinite loop
    • CPU load high, but network traffic low
    • Maybe the new feature would be: CPU load/network traffic
      • This will be large in anomaly and small in non-anomaly

Optional material included modeling anomaly detection using a multivariate Gaussian distribution. This is accomplished by modeling the mean of all of the features as a vector and the variance as a covariance matrix. This allows the model to capture correlations within the data that would otherwise be lost if anomaly detection was performed using the above described method. In other words, the above model can only fit Gaussian distributions that are axis-aligned, i.e., unable to model correlations between features. This special case is when the covariate matrix has zeros in all places other than its diagnoal.

Original model Multivariate Gaussian
Manually create features to capture anomalies where x1, x2 take unusual combinations of values Automatically captures correlations between features
Computationally cheaper (scales better to large n such as >10,000) Computationally more expensive
OK even if m (training set size) is small Must have m > n or else covariance matrix is non-invertible. It is preferred that m >= 10n.

Covariance matrix may also be non-invertible of features are redundant.

Recommender systems

Predicting movie ratings:

  • Allow users to rate movies using zero to five stars
Movie Alice (1) Bob (2) Carol (3) Dave (4)
Love at last 5 5 0 0
Romance forever 5 0
Cute puppies of love 4 0
Nonstop car chases 0 0 5 4
Swords v. karate 0 0 5

Empty cells are movies that have not been rated.

Terminology:

  • nu = no. users

  • nm = no. movies

  • r(i,j) = 1 if user j has rated movie i

  • y(i,j) = rating given by user j to movie i

    • defined only if r(i,j) = 1
  • Above, nu = 4, nm = 5

  • The recommender system problem is, given the empty cells, predict the likely rating for each user and movie

Content-based recommender systems

  • Suppose for each movie, there are associated features, e.g.,
Movie Alice (1) Bob (2) Carol (3) Dave (4) x1 (romance) x2 (action)
Love at last 5 5 0 0 0.9 0
Romance forever 5 0 1 0.01
Cute puppies of love 4 0 0.99 0
Nonstop car chases 0 0 5 4 0.1 1
Swords v. karate 0 0 5 0 0.9

Therefore, each movie can be represented as a feature vector, x(1) = [1 0.9 0].

  • x(1) ... x(5) would be the feature vectors for each movie
  • For each user j, learn a parameter theta_j, and predict user j as rating movie i with theta_j_transpose * x(i)
  • Contretely, Alice (user 1) would have a theta_1 parameter vector
    • Predicting Cute puppies of love would be theta_1_tranpose * x(3)
      • theta_1 is associated with Alice
      • x(3) is associated with Cute puppies of love
  • Formally,
    • r(i,j) = 1 if user j has rated movie i
    • y(i,j) = rating given by user j to movie i
    • theta_j = parameter vector for user j
    • x(i) = feature vector for movie i
    • m(j) = no. of movies rated by user j
  • To learn theta_j, this is a linear regression problem
    • minimize the sum of the squared error term summed over all movies that the user has rated
  • To learn theta_1 to theta_nu, sum over all users as well as all movies

There are cases where these features of each movie are not available and content-based systems will not work.

Collaborative filtering

  • Assume the features are blank:
Movie Alice (1) Bob (2) Carol (3) Dave (4) x1 (romance) x2 (action)
Love at last 5 5 0 0 ? ?
Romance forever 5 0 ? ?
Cute puppies of love 4 0 ? ?
Nonstop car chases 0 0 5 4 ? ?
Swords v. karate 0 0 5 ? ?
  • Given by the users, parameter vectors, the features x1 and x2 can be inferred.
  • It turns out that by combining the above content based system with the method of giving parameters, the following system can be devised:
    • Guess theta find x, use x to improve thetas, use thetas to improve x ... until convergence
  • This system works, of course, with the knowledge that there are some subset of users rating some disparate subset of movies
  • These data over time can be used to improve the thetas and features learned by the collaborative filtering algorithm
  1. Initialize x(1), ..., x(n), theta(1), ..., theta(n) to small random values
  2. Minimize cost function J(x(1), ..., x(n), theta(1), ..., theta(n)) using gradient descent (or an advnaced optimization algorithm)
  3. For a user with parameters theta and a movie with (learned) features x, predict a star rating of theta_transpose * x

Collaborative filter is the same as low rank matrix factorization. Mean normalization helps to ensure the collaborative filtering algorithm function for users who have not rated any products/movies.

  • Users who haven't rated anything, this sets the user to have the mean ratings