Time since last having seen a particular value. Remove the .shift()
if you're not doing target encoding.
>>> import pandas as pd
>>> df = pd.DataFrame([
... (1, 'cloudy'),
... (2, 'cloudy'),
... (3, 'sunny'),
... (4, 'sunny'),
... (5, 'cloudy'),
... (6, 'sunny')
... ], columns=['time', 'location'])
>>> (df['time'] - df['time'].groupby(df['location'].shift().eq('cloudy').cumsum()).transform('first'))
0 0
1 0
2 0
3 1
4 2
5 0
Name: time, dtype: int64
(not too sure about the exact vocabulary)
- Blending is averaging predictions
- Bagging is averaging predictions with models trained on different folds with replacement
- Pasting is the same as bagging but without replacement
- Bumping is when a model is trained on different folds and the one that performs the best on the original dataset is kept
- Stacking is training a model on predictions made by other models
- Replace by mean, median or most frequent value
- Random Forest imputation
Day of week, hours, minutes, are cyclic ordinal features; cosine and sine transforms should be used to express the cycle. See this StackEchange discussion.
from math import sin, pi
hours = list(range(24))
hours_cos = [cos(pi * h / 24) for h in hours]
hours_sin = [sin(pi * h / 24) for h in hours]
- One-hot encoding
- Target encoding
- Feature embedding
Use adstock transformation to take into account lag effects when measure marketing campaign impacts.
advertising = [6, 27, 0, 0, 20, 0, 20] # Marketing campaign intensities
for i in range(1, len(advertising)):
advertising[i] += advertising[i-1] * 0.5
print(advertising)
>>> [6, 30.0, 15.0, 7.5, 23.75, 11.875, 25.9375]
- Read this
- Try under-sampling if there is a lot of data
- Try over-sampling if there is not a lot of data
- Alway under/over-sample on the training set. Don't apply it on the entire set before doing a train/test split, if you do duplicates will exist between the two sets and the scores will be skewed
- Instead of predicting a class predict a probability and use a manual threshold to increase/reduce precision and recall as you wish
- Use weights/costs
- Limit the over-represented class
- Use time series cross-validation (explanatory diagram here)
- Adversarial validation can help making relevant cross-validation splits
- Pseudo-labeling by augmenting the training set with part of the labeled test set
- Using
log
on the target, training, and then usingexp
is naive