Source: Feature Engineering - Kaggle
Also see my previous notes:
MI is a measure of mutual dependence between two random variables. It measures the amount of information gained (reduction in uncertainty) about one variable, given the value of the other. MI is similar to Pearson Correlation Coefficient (PCC), but much more general. PCC can only capture linear relationship between two random variables.
- MI is non-negative; MI = 0 means the two variables are independent (zero information gain).
- MI is rarely higher than 2, even though there's no upper bound.
- Low value for
MI(feature0, target)
doesn't necessarily meanfeature0
is unimportant. Some features are more useful when paired up with another feature. For example, either length or breadth of a property may not be very predictive of its price by itself, but together (bivariate) will be a much stronger predictor. - High MI between a feature and a target doesn't necessarily mean the model will be able to make use of it. For instance, if the model cannot capture non-linear relationships, and the feature has a non-linear relationship with the target, then the optimized model will always be a poor fit. In such cases, you either need to transform the features (e.g. square a feature, product of two features etc.) to create relationships the model can handle, or choose a more powerful model.
Depending on the model, creating new features could help improve performance. But understanding the strengths and weaknesses of the model first is most important.
- Linear models can learn sums and differences naturally.
- Ratios are difficult to learn even for non-linear models such as boosted trees or neural networks.
- Normalizing features usually helps linear models and neural nets, but not much with tree models.
- Counting or summing across features (aggregating features) is especially helpful for tree models.
- When there is a categorical feature, it might be helpful to compute a grouped aggregate of some other feature that might help with prediction. For example, when predicting the income for an individual, and we know the individual's state, it might help the model to have the "average state income" as an input (based on training data). This is very similar to finding normalization factors based on training data. Note that such features are very sensitive to data drift (e.g., inflation).
TODO
TODO
TODO
Sources:
- https://arxiv.org/abs/2404.16130
- https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
TODO