-
Notifications
You must be signed in to change notification settings - Fork 1
/
Feature engineering.txt
54 lines (37 loc) · 2.58 KB
/
Feature engineering.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Feature engineering is the process of creating new features or transforming existing features in a dataset to improve the performance of a machine learning model. There are many techniques that can be used for feature engineering, including:
Scaling and Normalization: Scaling and normalizing features can help to bring them onto the same scale, which can be beneficial for algorithms that rely on distance measures. This can be done using techniques such as min-max scaling, standardization, and log transformation.
Encoding Categorical Features: Categorical features are often encoded as integers or one-hot encoded for use in machine learning algorithms.
Feature Selection: Feature selection is the process of selecting a subset of the most relevant features in a dataset. This can be done using techniques such as wrapper methods, filter methods, and embedded methods.
Feature Extraction: Feature extraction is the process of creating new features from existing features using techniques such as principal component analysis (PCA) and independent component analysis (ICA).
Polynomial Features: Polynomial features are created by adding powers of the original features to the feature set. This can be useful for fitting non-linear relationships.
Interaction Features: Interaction features are created by taking the product of two or more existing features. These can be useful for capturing relationships between features that may not be apparent from the individual features alone.
Here is an example of how to perform some of these techniques in Python using the pandas and sklearn libraries:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, RFE
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
# Load data
df = pd.read_csv('data.csv')
# Scale and normalize features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
# Log transform features
df_log = np.log(df + 1)
# Encoding categorical features
encoder = OneHotEncoder()
df_encoded = encoder.fit_transform(df[['cat_col_1', 'cat_col_2']])
# Feature selection
selector = SelectKBest(k=10)
df_selected = selector.fit_transform(df, y)
# Recursive feature elimination
rfe = RFE(estimator, 10)
df_rfe = rfe.fit_transform(df, y)
# Feature extraction
pca = PCA(n_components=10)
df_pca = pca.fit_transform(df)
# Polynomial features
poly = PolynomialFeatures(2)
df_poly = poly.fit_transform(df)
# Interaction features
df_interact = df[['col_1', 'col_2']] * df[['col_3', 'col_4']]