Updates for 2025 #105

mrdbourke · 2024-09-04T05:55:02Z

mrdbourke
Sep 4, 2024
Maintainer

mrdbourke · 2024-09-06T04:00:31Z

mrdbourke
Sep 6, 2024
Maintainer Author

Scikit-Learn Notebook

Link - https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn.ipynb

Section 1.2.1 - Filling missing values

Pandas inplace=True behaviour will change in 3.0.

Using inplace=True to reassign a column will produce a warning:

FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

To fix, use df[col] = df[col].method(value) syntax instead.

Example:

# Fill the missing values in the Make column
# Note: In previous versions of pandas, inplace=True was possible, however this will be changed in a future version, can use reassignment instead.

# Old (will produce a warning)
# car_sales_missing["Make"].fillna(value="missing", inplace=True)

# New
car_sales_missing["Make"] = car_sales_missing["Make"].fillna(value="missing")

Another example:

# Note: In previous versions of pandas, inplace=True was possible, however this will be changed in a future version, can use reassignment instead.

# Old (will produce a warning)
# car_sales_missing["Colour"].fillna(value="missing", inplace=True)

# New
car_sales_missing["Colour"] = car_sales_missing["Colour"].fillna(value="missing")

1 reply

mrdbourke Sep 23, 2024
Maintainer Author

See the following (this warning is fixed with the above):

mrdbourke · 2024-09-12T05:24:32Z

mrdbourke
Sep 12, 2024
Maintainer Author

Heart Disease Classification Notebook

Link - https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-3-structured-data-projects/end-to-end-heart-disease-classification.ipynb

TODO

Make sure notebook works on course book
- Done, see: https://dev.mrdbourke.com/zero-to-mastery-ml/end-to-end-heart-disease-classification/
Make sure images are available via link

Notes

TK - can seaborn be removed for matplotlib? (cleaner)
- Update: Going to keep for now.

Log

24 Sep 2024 - Code/text have been updated to reflect latest API changes + information.
12 Sep 2024 - All code runs, next is to update the text in the notebook to be cleaner + prepare the notebook for use in the course book (e.g. it should run well when on https://dev.mrdbourke.com/zero-to-mastery-ml/)

Changes

df.target.values -> df.target.to_numpy()

pandas.DataFrame.to_numpy() is the recommended method for extracting values as a NumPy array.

Old:

y = df.target.values

New:

y = df.target.to_numpy()

0 replies

mrdbourke · 2024-09-12T05:44:02Z

mrdbourke
Sep 12, 2024
Maintainer Author

Bulldozer Price Prediction Project (regression)

TODO

Make sure notebook works on course book
Fix pandas API change for detecting types:

- [x] Make sure images are available via link - [x] Make dataset downloadable from anywhere (e.g. if you're in Google Colab/local it should be easy to get the dataset) - [x] Add Colab version of the notebook to header

Log

30 Oct 2024 - Code + text cleaned up for 2025 onwards, see above links for updated versions
25 Sep 2024 - Start to clear up code + text for 2025 onwards

Changes

Created v2 of the end to end notebook

Kept original version of the notebook as v1, added an updated version as v2.

pandas datatype API

pd.api.types.is_string_dtype(df_tmp["UsageBand"]) -> pd.api.types.is_object_dtype(df_tmp["UsageBand"])

is_object_dtype returns True for more columns in the DataFrame (e.g. all the object datatypes) rather than using is_string_dtype. The object dtype in pandas can be mixed datatypes, including strings.

Old:

for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

New:

for label, content in df_tmp.items():
    if pd.api.types.is_object_dtype(content): # using object dtype check
        print(label)

Converting strings/object columns to categories

Instead of only finding string columns to convert their values to categories, can now directly find object (object datatype can be string or mixed) columns and convert their values to category datatype.

Old:

for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[label] = content.astype("category").cat.as_ordered()

New:

# This will turn all of the string values into category values
for label, content in df_tmp.items():
    if pd.api.types.is_object_dtype(content):
        df_tmp[label] = df_tmp[label].astype("category") # use astype() for type conversion

Rename save file to be different names

The processed data file names were poor. Updated them to reflect the original dataset name as well as the actual changes that were made.

Old:

# Save preprocessed data
df_tmp.to_csv("../data/bluebook-for-bulldozers/train_tmp.csv",
              index=False)

New:

# Save preprocessed data
df_tmp.to_csv("../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.csv", # includes original dataset name (TrainAndValid)
              index=False)

Updated `show_scores` helper function with native `root_mean_squared_log_error`

As of Scikit-Learn 1.4, a native implementation of Root Mean Squared Log Error is available via sklearn.metrics.root_mean_squared_log_error.

The show_scores evaluation function has been updated to use this metric.

It has also four new parameters, train_features and valid_features for accepting training and validation data as well as train_labels and valid_labels for accepting training and validation labels.

Old:

# Create evaluation function (the competition uses Root Mean Square Log Error)
from sklearn.metrics import mean_squared_log_error, mean_absolute_error

# Create custom RMSLE function 
def rmsle(y_test, y_preds):
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create function to evaluate our model
def show_scores(model):

    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)

    # Create a scores dictionary of different values
    scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
              "Valid MAE": mean_absolute_error(y_valid, val_preds),
              "Training RMSLE": rmsle(y_train, train_preds),
              "Valid RMSLE": rmsle(y_valid, val_preds),
              "Training R^2": model.score(X_train, y_train),
              "Valid R^2": model.score(X_valid, y_valid)}
    return scores

New:

# Create evaluation function (the competition uses Root Mean Square Log Error)
from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error

# Create function to evaluate our model
def show_scores(model, 
                train_features=X_train_preprocessed,
                train_labels=y_train,
                valid_features=X_valid_preprocessed,
                valid_labels=y_valid):
    
    # Make predictions on train and validation features
    train_preds = model.predict(train_features)
    val_preds = model.predict(valid_features)

    # Create a scores dictionary of different evaluation metrics
    scores = {"Training MAE": mean_absolute_error(train_labels, train_preds),
              "Valid MAE": mean_absolute_error(valid_labels, val_preds),
              "Training RMSLE": root_mean_squared_log_error(train_labels, train_preds),
              "Valid RMSLE": root_mean_squared_log_error(valid_labels, val_preds),
              "Training R^2": model.score(train_features, train_labels),
              "Valid R^2": model.score(valid_features, valid_labels)}
    return scores

Set `verbose=3` in `RandomizedSearchCV`

Use integers for verbose parameter.

Instead of verbose=True.

See docs for more: https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Old:

rs_model = RandomizedSearchCV(estimator=RandomForestRegressor(),
                              param_distributions=rf_grid,
                              n_iter=20,
                              cv=5,
                              verbose=True)

New:

rs_model = RandomizedSearchCV(estimator=RandomForestRegressor(),
                              param_distributions=rf_grid,
                              n_iter=20,
                              cv=5,
                              verbose=3) # control how much output gets produced, higher number = more output

Fixed data leakage issue for modelling/scoring on Train/Validation set

Previously data was filled on TrainAndValid.csv, it is now fit_transform on Train.csv and transform on Valid.csv.

See section 3. Splitting data into the right train/validation sets of notebook v2.

Changed test DataFrame preprocessing function

Previous preprocessing function was based on using TrainAndValid.csv together and preprocessing with pandas (this could run into issues of new columns not available in the test dataset).

New preprocessing steps are based on using Train.csv and Valid.csv separately and preprocessing with Scikit-Learn (an ordinal encoder is trained/fit on the training data only and use to encode the valid/test datasets, in turn, only using information from the training dataset to enhance the unseen valid and test datasets).

Old:

import pandas as pd

test_df = pd.read_csv("../data/bluebook-for-bulldozers/Test.csv",
                      parse_dates=["saledate"])

# Use pandas for data preprocessing
# Note: This uses information from the test dataset to preprocess itself. This is not best practice.
# Best to use information from the training dataset (seen data) to preprocess the test data (unseen data). 
# See new code below for using an encoder trained on the training dataset to transform the test data.
def preprocess_data(df):
    # Add datetime parameters for saledate
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayofweek"] = df.saledate.dt.dayofweek
    df["saleDayofyear"] = df.saledate.dt.dayofyear

    # Drop original saledate
    df.drop("saledate", axis=1, inplace=True)
    
    # Fill numeric rows with the median
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                df[label+"_is_missing"] = pd.isnull(content)
                df[label] = content.fillna(content.median())
                
        # Turn categorical variables into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"] = pd.isnull(content)
            # We add the +1 because pandas encodes missing categories as -1
            df[label] = pd.Categorical(content).codes+1        
    
    return df

df_test = preprocess_data(test_df)
df_test.head()

New:

import pandas as pd

test_df = pd.read_csv("../data/bluebook-for-bulldozers/Test.csv",
                      parse_dates=["saledate"])

# Make a function to add date columns
def add_datetime_features_to_df(df, date_column="saledate"):
    # Add datetime parameters for saledate
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear

    # Drop original saledate column
    df.drop("saledate", axis=1, inplace=True)

    return df

# Preprocess test_df to have same columns as train_df (add the datetime features)
test_df = add_datetime_features_to_df(df=test_df)

# Use the ordinal encoder fit on the training dataset to turn the categorical features of test_df into numerical features
test_df[categorical_features] = ordinal_encoder.transform(test_df[categorical_features].astype(str))

# Fit a model (this should work now)
test_preds = best_model.predict(test_df)
test_preds.shape

>>> (12457,)

Remove `seaborn` requirement for `matplotlib`

One less library to require.

matplotlib can handle the plots we need :D

Old:

import seaborn as sns

# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importance": importances})
          .sort_values("feature_importance", ascending=False)
          .reset_index(drop=True))
    
    sns.barplot(x="feature_importance",
                y="features",
                data=df[:n],
                orient="h")

plot_features(X_train.columns, best_model.feature_importances_)

New:

# Create feature importance DataFrame
column_names = test_df.columns
feature_importance_df = pd.DataFrame({"feature_names": column_names,
                                      "feature_importance": best_model_feature_importances}).sort_values(by="feature_importance",
                                                                                                         ascending=False)

# Plot the top feature importance values
top_n = 20
plt.figure(figsize=(10, 5))
plt.barh(y=feature_importance_df["feature_names"][:top_n], # Plot the top_n feature importance values
         width=feature_importance_df["feature_importance"][:top_n])
plt.title(f"Top {top_n} Feature Importance Values for Best RandomForestRegressor Model")
plt.xlabel("Feature importance value")
plt.ylabel("Feature name")
plt.gca().invert_yaxis();

New output:

New: Add checkpointing in `parquet` format

To preserve datatypes, can use the parquet format.

This requires installing pyarrow/fastparquet.

pip install pyarrow fastparquet

See more: https://pandas.pydata.org/docs/user_guide/io.html#parquet

Code example:

# To save to parquet format requires pyarrow or fastparquet (or both)
# Can install via `pip install pyarrow fastparquet`
df_tmp.to_parquet(path="TrainAndValid_object_values_as_categories.parquet", 
                  engine="auto")

# Read in df_tmp from parquet format
df_tmp = pd.read_parquet("TrainAndValid_object_values_as_categories.parquet")

# Using parquet format, datatypes are preserved
df_tmp.info()

Misc

df_test renamed to test_df
df_preds renamed to pred_df

0 replies

mohsin411 · 2024-10-04T18:11:58Z

mohsin411
Oct 4, 2024

I will wait for new upcoming course many thanks friend for providing awesome courses.....I really appreciate it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates for 2025 #105

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Updates for 2025 #105

mrdbourke Sep 4, 2024 Maintainer

Setup

Notebooks

Projects

Misc

Replies: 4 comments · 1 reply

mrdbourke Sep 6, 2024 Maintainer Author

Scikit-Learn Notebook

Section 1.2.1 - Filling missing values

mrdbourke Sep 23, 2024 Maintainer Author

mrdbourke Sep 12, 2024 Maintainer Author

Heart Disease Classification Notebook

TODO

Notes

Log

Changes

mrdbourke Sep 12, 2024 Maintainer Author

Bulldozer Price Prediction Project (regression)

TODO

Log

Changes

Created v2 of the end to end notebook

pandas datatype API

Converting strings/object columns to categories

Rename save file to be different names

Updated show_scores helper function with native root_mean_squared_log_error

Set verbose=3 in RandomizedSearchCV

Fixed data leakage issue for modelling/scoring on Train/Validation set

Changed test DataFrame preprocessing function

Remove seaborn requirement for matplotlib

New: Add checkpointing in parquet format

Misc

mohsin411 Oct 4, 2024

mrdbourke
Sep 4, 2024
Maintainer

Replies: 4 comments 1 reply

mrdbourke
Sep 6, 2024
Maintainer Author

mrdbourke Sep 23, 2024
Maintainer Author

mrdbourke
Sep 12, 2024
Maintainer Author

mrdbourke
Sep 12, 2024
Maintainer Author

Updated `show_scores` helper function with native `root_mean_squared_log_error`

Set `verbose=3` in `RandomizedSearchCV`

Remove `seaborn` requirement for `matplotlib`

New: Add checkpointing in `parquet` format

mohsin411
Oct 4, 2024