July 2024

Feature Engineering

Also see my previous notes:

Feature selection
Feature extraction

Mutual Information (MI)

MI is a measure of mutual dependence between two random variables. It measures the amount of information gained (reduction in uncertainty) about one variable, given the value of the other. MI is similar to Pearson Correlation Coefficient (PCC), but much more general. PCC can only capture linear relationship between two random variables.

MI is non-negative; MI = 0 means the two variables are independent (zero information gain).
MI is rarely higher than 2, even though there's no upper bound.
Low value for MI(feature0, target) doesn't necessarily mean feature0 is unimportant. Some features are more useful when paired up with another feature. For example, either length or breadth of a property may not be very predictive of its price by itself, but together (bivariate) will be a much stronger predictor.
High MI between a feature and a target doesn't necessarily mean the model will be able to make use of it. For instance, if the model cannot capture non-linear relationships, and the feature has a non-linear relationship with the target, then the optimized model will always be a poor fit. In such cases, you either need to transform the features (e.g. square a feature, product of two features etc.) to create relationships the model can handle, or choose a more powerful model.

Creating Features

Depending on the model, creating new features could help improve performance. But understanding the strengths and weaknesses of the model first is most important.

Linear models can learn sums and differences naturally.
Ratios are difficult to learn even for non-linear models such as boosted trees or neural networks.
Normalizing features usually helps linear models and neural nets, but not much with tree models.
Counting or summing across features (aggregating features) is especially helpful for tree models.
When there is a categorical feature, it might be helpful to compute a grouped aggregate of some other feature that might help with prediction. For example, when predicting the income for an individual, and we know the individual's state, it might help the model to have the "average state income" as an input (based on training data). This is very similar to finding normalization factors based on training data. Note that such features are very sensitive to data drift (e.g., inflation).

Clustering

TODO

Principal Component Analysis

TODO

Target Encoding

TODO

Graph-RAG

Sources:

https://arxiv.org/abs/2404.16130
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024-07.md

2024-07.md

July 2024

Feature Engineering

Mutual Information (MI)

Creating Features

Clustering

Principal Component Analysis

Target Encoding

Graph-RAG

Files

2024-07.md

Latest commit

History

2024-07.md

File metadata and controls

July 2024

Feature Engineering

Mutual Information (MI)

Creating Features

Clustering

Principal Component Analysis

Target Encoding

Graph-RAG