diff --git a/data/.gitignore b/data/.gitignore index 1be359e..01b7e33 100644 --- a/data/.gitignore +++ b/data/.gitignore @@ -1,2 +1 @@ -# Ignore everything in this directory **/* \ No newline at end of file diff --git a/src/datasets.py b/src/datasets.py index dec7ab7..27eedd2 100644 --- a/src/datasets.py +++ b/src/datasets.py @@ -1,12 +1,12 @@ """ -Dataset Module for NFL Big Data Bowl 2024 +Dataset Module for NFL Big Data Bowl 2026 -This module defines the BDB2024_Dataset class, which is used to load and preprocess -data for training machine learning models. It includes functionality for both -'transformer' and 'zoo' model types. +This module defines the BDB2026_Dataset class, which is used to load and preprocess +data for training machine learning models to predict ball landing locations. +It includes functionality for 'transformer' model type using position and kinematic features only. Classes: - BDB2024_Dataset: Custom dataset class for NFL tracking data + BDB2026_Dataset: Custom dataset class for NFL tracking data Functions: load_datasets: Load preprocessed datasets for a specific model type and data split @@ -29,15 +29,15 @@ DATASET_DIR = Path("data/datasets/") -class BDB2024_Dataset(Dataset): +class BDB2026_Dataset(Dataset): """ - Custom dataset class for NFL tracking data. + Custom dataset class for NFL tracking data to predict ball landing locations. This class preprocesses and stores NFL tracking data for use in machine learning models. - It supports both 'transformer' and 'zoo' model types. + Uses only position and kinematic features. Attributes: - model_type (str): Type of model ('transformer' or 'zoo') + model_type (str): Type of model ('transformer') keys (list): List of unique identifiers for each data point feature_df_partition (pd.DataFrame): Preprocessed feature data tgt_df_partition (pd.DataFrame): Preprocessed target data @@ -55,28 +55,28 @@ def __init__( Initialize the dataset. Args: - model_type (str): Type of model ('transformer' or 'zoo') + model_type (str): Type of model (currently only 'transformer' supported) feature_df (pl.DataFrame): DataFrame containing feature data tgt_df (pl.DataFrame): DataFrame containing target data Raises: ValueError: If an invalid model_type is provided """ - if model_type not in ["transformer", "zoo"]: - raise ValueError("model_type must be either 'transformer' or 'zoo'") + if model_type not in ["transformer"]: + raise ValueError("model_type must be 'transformer'") self.model_type = model_type - self.keys = list(feature_df.select(["gameId", "playId", "mirrored", "frameId"]).unique().rows()) + self.keys = list(feature_df.select(["game_id", "play_id", "frame_id"]).unique().rows()) # Convert to pandas form with index for quick row retrieval self.feature_df_partition = ( feature_df.to_pandas(use_pyarrow_extension_array=True) - .set_index(["gameId", "playId", "mirrored", "frameId", "nflId"]) + .set_index(["game_id", "play_id", "frame_id", "nfl_id"]) .sort_index() ) self.tgt_df_partition = ( tgt_df.to_pandas(use_pyarrow_extension_array=True) - .set_index(["gameId", "playId", "mirrored", "frameId"]) + .set_index(["game_id", "play_id", "frame_id"]) .sort_index() ) @@ -136,7 +136,7 @@ def __getitem__(self, idx: int) -> tuple[np.ndarray, np.ndarray]: def transform_input_frame_df(self, frame_df: pd.DataFrame) -> np.ndarray: """ - Transform input frame DataFrame to numpy array based on model type. + Transform input frame DataFrame to numpy array using position and kinematic features only. Args: frame_df (pd.DataFrame): Input frame DataFrame @@ -149,14 +149,12 @@ def transform_input_frame_df(self, frame_df: pd.DataFrame) -> np.ndarray: """ if self.model_type == "transformer": return self.transformer_transform_input_frame_df(frame_df) - elif self.model_type == "zoo": - return self.zoo_transform_input_frame_df(frame_df) else: raise ValueError(f"Unknown model type: {self.model_type}") def transform_target_df(self, tgt_df: pd.DataFrame) -> np.ndarray: """ - Transform target DataFrame to numpy array. + Transform target DataFrame to numpy array (ball_land_x, ball_land_y). Args: tgt_df (pd.DataFrame): Target DataFrame @@ -167,90 +165,59 @@ def transform_target_df(self, tgt_df: pd.DataFrame) -> np.ndarray: Raises: AssertionError: If the output shape is not as expected """ - y = tgt_df[["tackle_x_rel", "tackle_y_rel"]].to_numpy(dtype=np.float32).squeeze() + y = tgt_df[["ball_land_x", "ball_land_y"]].to_numpy(dtype=np.float32).squeeze() assert y.shape == (2,), f"Expected shape (2,), got {y.shape}" return y def transformer_transform_input_frame_df(self, frame_df: pd.DataFrame) -> np.ndarray: """ - Transform input frame DataFrame for transformer model. + Transform input frame DataFrame for transformer model using position and kinematic features. + Pads or truncates to a fixed number of players (22) for batching. - Args: - frame_df (pd.DataFrame): Input frame DataFrame - - Returns: - np.ndarray: Transformed input features for transformer model - - Raises: - AssertionError: If the output shape is not as expected - """ - features = ["x_rel", "y_rel", "vx", "vy", "side", "is_ball_carrier"] - x = frame_df[features].to_numpy(dtype=np.float32) - assert x.shape == (22, len(features)), f"Expected shape (22, {len(features)}), got {x.shape}" - return x - - def zoo_transform_input_frame_df(self, frame_df: pd.DataFrame) -> np.ndarray: - """ - Transform input frame DataFrame for zoo model. + Features used: + - x, y: position + - s: speed + - a: acceleration + - vx, vy: velocity components + - ox, oy: orientation components Args: frame_df (pd.DataFrame): Input frame DataFrame Returns: - np.ndarray: Transformed input features for zoo model + np.ndarray: Transformed input features for transformer model (shape: [22, 8]) Raises: AssertionError: If the output shape is not as expected """ - # Isolate offensive and defensive players - ball_carrier = frame_df[frame_df["is_ball_carrier"] == 1] - off_plyrs = frame_df[(frame_df["side"] == 1) & (frame_df["is_ball_carrier"] == 0)] - def_plyrs = frame_df[frame_df["side"] == -1] - - ball_carr_mvmt_feats = ball_carrier[["x_rel", "y_rel", "vx", "vy"]].to_numpy(dtype=np.float32).squeeze() - off_mvmt_feats = off_plyrs[["x_rel", "y_rel", "vx", "vy"]].to_numpy(dtype=np.float32) - def_mvmt_feats = def_plyrs[["x_rel", "y_rel", "vx", "vy"]].to_numpy(dtype=np.float32) - - # Zoo interaction features - x = [ - # def_vx, def_vy - np.tile(def_mvmt_feats[:, 2:], (10, 1, 1)), - # def_x - ball_x, def_y - ball_y - np.tile( - def_mvmt_feats[None, :, :2] - ball_carr_mvmt_feats[None, None, :2], - (10, 1, 1), - ), - # def_vx - ball_vx, def_vy - ball_vy - np.tile( - def_mvmt_feats[None, :, 2:] - ball_carr_mvmt_feats[None, None, 2:], - (10, 1, 1), - ), - # off_x - def_x, off_y - def_y - off_mvmt_feats[:, None, :2] - def_mvmt_feats[None, :, :2], - # off_vx - def_vx, off_vy - def_vy - off_mvmt_feats[:, None, 2:] - def_mvmt_feats[None, :, 2:], - ] - - x = np.concatenate( - x, - dtype=np.float32, - axis=-1, - ) - - assert x.shape == (10, 11, 10), f"Expected shape (10, 11, 10), got {x.shape}" + features = ["x", "y", "s", "a", "vx", "vy", "ox", "oy"] + x = frame_df[features].to_numpy(dtype=np.float32) + num_players = x.shape[0] + + # Pad or truncate to fixed size (22 players) + max_players = 22 + if num_players < max_players: + # Pad with zeros + padding = np.zeros((max_players - num_players, len(features)), dtype=np.float32) + x = np.vstack([x, padding]) + elif num_players > max_players: + # Truncate (shouldn't happen with this data, but handle it) + x = x[:max_players] + + assert x.shape == (max_players, len(features)), f"Expected shape ({max_players}, {len(features)}), got {x.shape}" return x -def load_datasets(model_type: str, split: str) -> BDB2024_Dataset: +def load_datasets(model_type: str, split: str) -> "BDB2026_Dataset": """ Load datasets for a specific model type and data split. Args: - model_type (str): Type of model ('transformer' or 'zoo') + model_type (str): Type of model ('transformer') split (str): Data split ('train', 'val', or 'test') Returns: - BDB2024_Dataset: Loaded dataset for the specified model type and split + BDB2026_Dataset: Loaded dataset for the specified model type and split Raises: ValueError: If an unknown split is specified @@ -268,20 +235,20 @@ def load_datasets(model_type: str, split: str) -> BDB2024_Dataset: def main(): """ - Main function to create and save datasets for different model types and splits. + Main function to create and save datasets for transformer model. """ for split in ["test", "val", "train"]: feature_df = pl.read_parquet(PREPPED_DATA_DIR / f"{split}_features.parquet") tgt_df = pl.read_parquet(PREPPED_DATA_DIR / f"{split}_targets.parquet") - for model_type in ["zoo", "transformer"]: - print(f"Creating dataset for {model_type=}, {split=}...") - tic = time.time() - dataset = BDB2024_Dataset(model_type, feature_df, tgt_df) - out_dir = DATASET_DIR / model_type - out_dir.mkdir(exist_ok=True, parents=True) - with open(out_dir / f"{split}_dataset.pkl", "wb") as f: - pickle.dump(dataset, f) - print(f"Took {(time.time() - tic)/60:.1f} mins") + model_type = "transformer" + print(f"Creating dataset for {model_type=}, {split=}...") + tic = time.time() + dataset = BDB2026_Dataset(model_type, feature_df, tgt_df) + out_dir = DATASET_DIR / model_type + out_dir.mkdir(exist_ok=True, parents=True) + with open(out_dir / f"{split}_dataset.pkl", "wb") as f: + pickle.dump(dataset, f) + print(f"Took {(time.time() - tic)/60:.1f} mins") if __name__ == "__main__": diff --git a/src/models.py b/src/models.py index 123d787..ababa06 100644 --- a/src/models.py +++ b/src/models.py @@ -285,16 +285,14 @@ def __init__( """ super().__init__() self.model_type = model_type.lower() - self.model_class = SportsTransformer if self.model_type == "transformer" else TheZooArchitecture - self.feature_len = 6 if self.model_type == "transformer" else 10 + self.model_class = SportsTransformer + # 8 features: x, y, s, a, vx, vy, ox, oy + self.feature_len = 8 self.model = self.model_class( feature_len=self.feature_len, model_dim=model_dim, num_layers=num_layers, dropout=dropout ) - self.example_input_array = ( - torch.randn((batch_size, 22, self.feature_len)) - if self.model_type == "transformer" - else torch.randn((batch_size, 10, 11, self.feature_len)) - ) + # Variable number of players per frame, use 22 as example + self.example_input_array = torch.randn((batch_size, 22, self.feature_len)) self.learning_rate = learning_rate self.num_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad) diff --git a/src/prep_data.py b/src/prep_data.py index 6d6580d..cc0a60f 100644 --- a/src/prep_data.py +++ b/src/prep_data.py @@ -1,20 +1,15 @@ """ -Data Preparation Module for NFL Big Data Bowl 2024 +Data Preparation Module for NFL Big Data Bowl 2026 This module processes raw NFL tracking data to prepare it for machine learning models. It includes functions for loading, cleaning, and transforming the data, as well as splitting it into train, validation, and test sets. Functions: - get_players_df: Load and preprocess player data - get_plays_df: Load and preprocess play data - get_tracking_df: Load and preprocess tracking data - add_features_to_tracking_df: Add derived features to tracking data + load_input_data: Load input tracking data from CSV files convert_tracking_to_cartesian: Convert polar coordinates to Cartesian standardize_tracking_directions: Standardize play directions - augment_mirror_tracking: Augment data by mirroring the field - add_relative_positions: Add relative position features - get_tackle_loc_target_df: Generate target dataframe for tackle location prediction + prepare_tracking_data: Prepare tracking data with position and kinematic features split_train_test_val: Split data into train, validation, and test sets main: Main execution function @@ -25,114 +20,28 @@ import polars as pl -INPUT_DATA_DIR = Path("data/bdb_2024/") +INPUT_DATA_DIR = Path("data/train/") -def get_players_df() -> pl.DataFrame: +def load_input_data() -> pl.DataFrame: """ - Load player-level data and preprocesses features. + Load input tracking data from CSV files. Returns: - pl.DataFrame: Preprocessed player data with additional features. + pl.DataFrame: Raw tracking data with all fields from input files. """ - return ( - pl.read_csv(INPUT_DATA_DIR / "players.csv", null_values=["NA", "nan", "N/A", "NaN", ""]) - .with_columns( - height_inches=( - pl.col("height").str.split("-").map_elements(lambda s: int(s[0]) * 12 + int(s[1]), return_dtype=int) - ) - ) - .with_columns( - weight_Z=(pl.col("weight") - pl.col("weight").mean()) / pl.col("weight").std(), - height_Z=(pl.col("height_inches") - pl.col("height_inches").mean()) / pl.col("height_inches").std(), - ) - ) - - -def get_plays_df() -> pl.DataFrame: - """ - Load play-level data and preprocesses features. - - Returns: - pl.DataFrame: Preprocessed play data with additional features. - """ - return pl.read_csv(INPUT_DATA_DIR / "plays.csv", null_values=["NA", "nan", "N/A", "NaN", ""]).with_columns( - distanceToGoal=( - pl.when(pl.col("possessionTeam") == pl.col("yardlineSide")) - .then(100 - pl.col("yardlineNumber")) - .otherwise(pl.col("yardlineNumber")) - ) - ) - - -def get_tracking_df() -> pl.DataFrame: - """ - Load tracking data and preprocesses features. Notably, exclude rows representing the football's movement. - - Returns: - pl.DataFrame: Preprocessed tracking data with additional features. - """ - # don't include football rows for this project - return pl.read_csv(INPUT_DATA_DIR / "tracking_week_*.csv", null_values=["NA", "nan", "N/A", "NaN", ""]).filter( - pl.col("displayName") != "football" - ) - - -def add_features_to_tracking_df( - tracking_df: pl.DataFrame, - players_df: pl.DataFrame, - plays_df: pl.DataFrame, -) -> pl.DataFrame: - """ - Consolidates play and player level data into the tracking data. - - Args: - tracking_df (pl.DataFrame): Tracking data - players_df (pl.DataFrame): Player data - plays_df (pl.DataFrame): Play data - - Returns: - pl.DataFrame: Tracking data with additional features. - """ - # add `is_ball_carrier`, `team_indicator`, and other features to tracking data - og_len = len(tracking_df) - tracking_df = ( - tracking_df.join( - plays_df.select( - "gameId", - "playId", - "ballCarrierId", - "possessionTeam", - "down", - "yardsToGo", - "distanceToGoal", - "playResult", - ), - on=["gameId", "playId"], - how="inner", - ) - .join( - players_df.select(["nflId", "weight_Z", "height_Z"]).unique(), - on="nflId", - how="inner", - ) - .with_columns( - is_ball_carrier=(pl.col("nflId") == pl.col("ballCarrierId")).cast(int), - side=pl.when(pl.col("club") == pl.col("possessionTeam")) - .then(pl.lit(1)) - .otherwise(pl.lit(-1)) - .alias("side"), - ) - .drop(["ballCarrierId", "possessionTeam"]) - ) - assert len(tracking_df) == og_len, "Lost rows when joining tracking data with play/player data" - - return tracking_df + # Read all input CSV files from the train directory + df = pl.read_csv(INPUT_DATA_DIR / "input_*.csv", null_values=["NA", "nan", "N/A", "NaN", ""]) + print(f"Loaded {len(df)} rows from input files") + print(f"Unique plays: {df.n_unique(['game_id', 'play_id'])}") + return df def convert_tracking_to_cartesian(tracking_df: pl.DataFrame) -> pl.DataFrame: """ Convert polar coordinates to Unit-circle Cartesian format. + We keep the original position (x, y) and kinematic variables (s, a, dir, o), + and also compute cartesian velocity components (vx, vy) and orientation (ox, oy). Args: tracking_df (pl.DataFrame): Tracking data @@ -142,22 +51,25 @@ def convert_tracking_to_cartesian(tracking_df: pl.DataFrame) -> pl.DataFrame: """ return ( tracking_df.with_columns( - dir=((pl.col("dir") - 90) * -1) % 360, - o=((pl.col("o") - 90) * -1) % 360, + # Adjust dir and o to match unit circle convention + dir_adjusted=((pl.col("dir") - 90) * -1) % 360, + o_adjusted=((pl.col("o") - 90) * -1) % 360, ) # convert polar vectors to cartesian ((s, dir) -> (vx, vy), (o) -> (ox, oy)) .with_columns( - vx=pl.col("s") * pl.col("dir").radians().cos(), - vy=pl.col("s") * pl.col("dir").radians().sin(), - ox=pl.col("o").radians().cos(), - oy=pl.col("o").radians().sin(), + vx=pl.col("s") * pl.col("dir_adjusted").radians().cos(), + vy=pl.col("s") * pl.col("dir_adjusted").radians().sin(), + ox=pl.col("o_adjusted").radians().cos(), + oy=pl.col("o_adjusted").radians().sin(), ) + .drop(["dir_adjusted", "o_adjusted"]) ) def standardize_tracking_directions(tracking_df: pl.DataFrame) -> pl.DataFrame: """ Standardize play directions to always moving left to right. + Also standardize ball_land_x and ball_land_y targets. Args: tracking_df (pl.DataFrame): Tracking data @@ -166,236 +78,169 @@ def standardize_tracking_directions(tracking_df: pl.DataFrame) -> pl.DataFrame: pl.DataFrame: Tracking data with standardized directions. """ return tracking_df.with_columns( - x=pl.when(pl.col("playDirection") == "right").then(pl.col("x")).otherwise(120 - pl.col("x")), - y=pl.when(pl.col("playDirection") == "right").then(pl.col("y")).otherwise(53.3 - pl.col("y")), - vx=pl.when(pl.col("playDirection") == "right").then(pl.col("vx")).otherwise(-1 * pl.col("vx")), - vy=pl.when(pl.col("playDirection") == "right").then(pl.col("vy")).otherwise(-1 * pl.col("vy")), - ox=pl.when(pl.col("playDirection") == "right").then(pl.col("ox")).otherwise(-1 * pl.col("ox")), - oy=pl.when(pl.col("playDirection") == "right").then(pl.col("oy")).otherwise(-1 * pl.col("oy")), - ).drop("playDirection") + x=pl.when(pl.col("play_direction") == "right").then(pl.col("x")).otherwise(120 - pl.col("x")), + y=pl.when(pl.col("play_direction") == "right").then(pl.col("y")).otherwise(53.3 - pl.col("y")), + vx=pl.when(pl.col("play_direction") == "right").then(pl.col("vx")).otherwise(-1 * pl.col("vx")), + vy=pl.when(pl.col("play_direction") == "right").then(pl.col("vy")).otherwise(-1 * pl.col("vy")), + ox=pl.when(pl.col("play_direction") == "right").then(pl.col("ox")).otherwise(-1 * pl.col("ox")), + oy=pl.when(pl.col("play_direction") == "right").then(pl.col("oy")).otherwise(-1 * pl.col("oy")), + # Also standardize the target variables + ball_land_x=pl.when(pl.col("play_direction") == "right").then(pl.col("ball_land_x")).otherwise(120 - pl.col("ball_land_x")), + ball_land_y=pl.when(pl.col("play_direction") == "right").then(pl.col("ball_land_y")).otherwise(53.3 - pl.col("ball_land_y")), + ).drop("play_direction") -def augment_mirror_tracking(tracking_df: pl.DataFrame) -> pl.DataFrame: +def prepare_tracking_data(tracking_df: pl.DataFrame) -> tuple[pl.DataFrame, pl.DataFrame]: """ - Augment data by mirroring the field assuming all plays are moving right. - There are arguments to not do this as football isn't perfectly symmetric (e.g. most QBs are right-handed) but - tackling is mostly symmetrical and for the sake of this demo I think more data is more important. + Prepare tracking data with position and kinematic features, and extract targets. Args: tracking_df (pl.DataFrame): Tracking data Returns: - pl.DataFrame: Augmented tracking data. + tuple: (features_df, targets_df) where features_df contains position/kinematic features + and targets_df contains ball_land_x and ball_land_y per frame """ - og_len = len(tracking_df) - - mirrored_tracking_df = tracking_df.clone().with_columns( - # only flip y values - y=53.3 - pl.col("y"), - vy=-1 * pl.col("vy"), - oy=-1 * pl.col("oy"), - mirrored=pl.lit(True), + # Filter out rows where ball_land_x or ball_land_y are null + tracking_df = tracking_df.filter( + pl.col("ball_land_x").is_not_null() & pl.col("ball_land_y").is_not_null() ) - tracking_df = pl.concat( - [ - tracking_df.with_columns(mirrored=pl.lit(False)), - mirrored_tracking_df, - ], - how="vertical", + print(f"After filtering nulls: {len(tracking_df)} rows") + print(f"Unique frames: {tracking_df.n_unique(['game_id', 'play_id', 'frame_id'])}") + + # Create target dataframe (one row per frame with ball landing location) + targets_df = ( + tracking_df.select([ + "game_id", + "play_id", + "frame_id", + "ball_land_x", + "ball_land_y", + ]) + .unique() ) - assert len(tracking_df) == og_len * 2, "Lost rows when mirroring tracking data" - return tracking_df - - -def get_tackle_loc_target_df(tracking_df: pl.DataFrame) -> pl.DataFrame: - """ - Generate target dataframe for tackle location prediction. - - Args: - tracking_df (pl.DataFrame): Tracking data - - Returns: - tuple: tuple containing tackle location target dataframe and filtered tracking data. - """ - # generate per-play target dataframe - TACKLE_EVENTS = ["tackle", "out_of_bounds", "touchdown", "qb_slide", "fumble"] - - # get the tackle location for each play as the ball carrier's location at the frame of the tackle - play_tackle_loc_df = ( - tracking_df.sort("frameId") - .filter(pl.col("event").is_in(TACKLE_EVENTS) & (pl.col("is_ball_carrier") == 1)) - .group_by(["gameId", "playId", "mirrored"]) - .tail(1) - .select( - [ - "gameId", - "playId", - "mirrored", - "nflId", - "displayName", - "frameId", - "event", - "x", - "y", - "playResult", - ] - ) - .rename( - { - "nflId": "ballCarrierNflId", - "displayName": "ballCarrierName", - "frameId": "tackle_frameId", - "event": "tackle_event", - "x": "tackle_x", - "y": "tackle_y", - } - ) - ) - - # we need to convert into relative coordinates which involves comparing against the - # anchor point which is per frame - tackle_loc_df = ( - play_tackle_loc_df.join( - tracking_df.select(["gameId", "playId", "mirrored", "frameId", "anchor_x", "anchor_y"]).unique(), - on=["gameId", "playId", "mirrored"], - how="inner", - ).with_columns( - tackle_x_rel=pl.col("tackle_x") - pl.col("anchor_x"), - tackle_y_rel=pl.col("tackle_y") - pl.col("anchor_y"), - ) - # .drop(["anchor_x", "anchor_y"]) - ) - - # only keep plays in dataset that have a valid tackle location target - og_play_count = len(tracking_df.select(["gameId", "playId"]).unique()) - tracking_df = tracking_df.join( - tackle_loc_df.select(["gameId", "playId", "mirrored"]).unique(), - on=["gameId", "playId", "mirrored"], - how="inner", - ) - new_play_count = len(tracking_df.select(["gameId", "playId"]).unique()) - print(f"Lost {(og_play_count - new_play_count)/og_play_count:.3%} plays when joining with tackle_loc_df") - return tackle_loc_df, tracking_df - - -def split_train_test_val(tracking_df: pl.DataFrame, target_df: pl.DataFrame) -> dict[str, pl.DataFrame]: + # Select only position and kinematic variables for features + # Position: x, y + # Kinematic: s (speed), a (acceleration), vx, vy (velocity components), ox, oy (orientation components) + features_df = tracking_df.select([ + "game_id", + "play_id", + "nfl_id", + "frame_id", + "x", + "y", + "s", + "a", + "vx", + "vy", + "ox", + "oy", + "player_side", # Keep side information to distinguish offense/defense + "ball_land_x", # Keep for joining later + "ball_land_y", + ]) + + return features_df, targets_df + + +def split_train_test_val(features_df: pl.DataFrame, targets_df: pl.DataFrame) -> dict[str, pl.DataFrame]: """ Split data into train, validation, and test sets. - Split is 70-15-15 for train-test-val respectively. Notably, we split at the play levle and not frame level. + Split is 70-15-15 for train-test-val respectively. Notably, we split at the play level and not frame level. This ensures no target contamination between splits. Args: - tracking_df (pl.DataFrame): Tracking data - target_df (pl.DataFrame): Target data + features_df (pl.DataFrame): Features data + targets_df (pl.DataFrame): Target data Returns: dict: Dictionary containing train, validation, and test dataframes. """ - tracking_df = tracking_df.sort(["gameId", "playId", "mirrored", "frameId"]) - target_df = target_df.sort(["gameId", "playId", "mirrored"]) + features_df = features_df.sort(["game_id", "play_id", "frame_id"]) + targets_df = targets_df.sort(["game_id", "play_id", "frame_id"]) print( - f"Total set: {tracking_df.n_unique(['gameId', 'playId', 'mirrored'])} plays,", - f"{tracking_df.n_unique(['gameId', 'playId', 'mirrored', "frameId"])} frames", + f"Total set: {features_df.n_unique(['game_id', 'play_id'])} plays,", + f"{features_df.n_unique(['game_id', 'play_id', 'frame_id'])} frames", ) - test_val_ids = tracking_df.select(["gameId", "playId"]).unique(maintain_order=True).sample(fraction=0.3, seed=42) - train_tracking_df = tracking_df.join(test_val_ids, on=["gameId", "playId"], how="anti") - train_tgt_df = target_df.join(test_val_ids, on=["gameId", "playId"], how="anti") + # Split at play level + test_val_ids = features_df.select(["game_id", "play_id"]).unique(maintain_order=True).sample(fraction=0.3, seed=42) + train_features_df = features_df.join(test_val_ids, on=["game_id", "play_id"], how="anti") + train_targets_df = targets_df.join(test_val_ids, on=["game_id", "play_id"], how="anti") print( - f"Train set: {train_tracking_df.n_unique(['gameId', 'playId', 'mirrored'])} plays,", - f"{train_tracking_df.n_unique(['gameId', 'playId', 'mirrored', "frameId"])} frames", + f"Train set: {train_features_df.n_unique(['game_id', 'play_id'])} plays,", + f"{train_features_df.n_unique(['game_id', 'play_id', 'frame_id'])} frames", ) test_ids = test_val_ids.sample(fraction=0.5, seed=42) # 70-15-15 split - test_tracking_df = tracking_df.join(test_ids, on=["gameId", "playId"], how="inner") - test_tgt_df = target_df.join(test_ids, on=["gameId", "playId"], how="inner") + test_features_df = features_df.join(test_ids, on=["game_id", "play_id"], how="inner") + test_targets_df = targets_df.join(test_ids, on=["game_id", "play_id"], how="inner") print( - f"Test set: {test_tracking_df.n_unique(['gameId', 'playId', 'mirrored'])} plays,", - f"{test_tracking_df.n_unique(['gameId', 'playId', 'mirrored', "frameId"])} frames", + f"Test set: {test_features_df.n_unique(['game_id', 'play_id'])} plays,", + f"{test_features_df.n_unique(['game_id', 'play_id', 'frame_id'])} frames", ) - val_ids = test_val_ids.join(test_ids, on=["gameId", "playId"], how="anti") - val_tracking_df = tracking_df.join(val_ids, on=["gameId", "playId"], how="inner") - val_tgt_df = target_df.join(val_ids, on=["gameId", "playId"], how="inner") + val_ids = test_val_ids.join(test_ids, on=["game_id", "play_id"], how="anti") + val_features_df = features_df.join(val_ids, on=["game_id", "play_id"], how="inner") + val_targets_df = targets_df.join(val_ids, on=["game_id", "play_id"], how="inner") print( - f"Validation set: {val_tracking_df.n_unique(['gameId', 'playId', 'mirrored'])} plays,", - f"{val_tracking_df.n_unique(['gameId', 'playId', 'mirrored', "frameId"])} frames", + f"Validation set: {val_features_df.n_unique(['game_id', 'play_id'])} plays,", + f"{val_features_df.n_unique(['game_id', 'play_id', 'frame_id'])} frames", ) return { - "train_features": train_tracking_df, - "train_targets": train_tgt_df, - "test_features": test_tracking_df, - "test_targets": test_tgt_df, - "val_features": val_tracking_df, - "val_targets": val_tgt_df, + "train_features": train_features_df, + "train_targets": train_targets_df, + "test_features": test_features_df, + "test_targets": test_targets_df, + "val_features": val_features_df, + "val_targets": val_targets_df, } -def add_relative_positions(tracking_df: pl.DataFrame) -> pl.DataFrame: - """ - Normalize x, y position against an anchor point of the ball carrier's location at the first frame of the play. - - This is not done for the purposes of defining a player ordering, as both The Zoo and Transformer are player-order - invariant models. This is done primarily for standardizing the data distribution and making frames look more alike - each other. - - Args: - tracking_df (pl.DataFrame): Tracking data - - Returns: - pl.DataFrame: Tracking data with relative position features. - """ - return ( - tracking_df.sort("frameId") - # Use play-level anchor of ball carrier's location at first frame in the play - .with_columns( - anchor_x=pl.col("x").filter(pl.col("is_ball_carrier") == 1).first().over(["gameId", "playId", "mirrored"]), - anchor_y=pl.col("y").filter(pl.col("is_ball_carrier") == 1).first().over(["gameId", "playId", "mirrored"]), - ) - .with_columns( - x_rel=pl.col("x") - pl.col("anchor_x"), - y_rel=pl.col("y") - pl.col("anchor_y"), - ) - ) - - def main(): """ Main execution function for data preparation. This function orchestrates the entire data preparation process, including: - 1. Loading raw data - 2. Adding features and transforming coordinates - 3. Generating target variables - 4. Splitting data into train, validation, and test sets - 5. Saving processed data to parquet files + 1. Loading raw input data + 2. Converting coordinates to cartesian + 3. Standardizing play directions + 4. Preparing features (position and kinematic only) and targets (ball_land_x, ball_land_y) + 5. Splitting data into train, validation, and test sets + 6. Saving processed data to parquet files """ - players_df = get_players_df() - plays_df = get_plays_df() - tracking_df = get_tracking_df() + # Load input data + tracking_df = load_input_data() - tracking_df = add_features_to_tracking_df(tracking_df, players_df, plays_df) + # Convert to cartesian coordinates tracking_df = convert_tracking_to_cartesian(tracking_df) - tracking_df = standardize_tracking_directions(tracking_df) - tracking_df = augment_mirror_tracking(tracking_df) - rel_tracking_df = add_relative_positions(tracking_df) + # Standardize directions (all plays moving left to right) + tracking_df = standardize_tracking_directions(tracking_df) - tkl_loc_tgt_df, rel_tracking_df = get_tackle_loc_target_df(rel_tracking_df) + # Prepare features and targets + features_df, targets_df = prepare_tracking_data(tracking_df) - split_dfs = split_train_test_val(rel_tracking_df, tkl_loc_tgt_df) + # Split into train/val/test + split_dfs = split_train_test_val(features_df, targets_df) + # Save to parquet files out_dir = Path("data/split_prepped_data/") out_dir.mkdir(exist_ok=True, parents=True) for key, df in split_dfs.items(): - sort_keys = ["gameId", "playId", "mirrored", "frameId"] + sort_keys = ["game_id", "play_id", "frame_id"] + if "nfl_id" in df.columns: + sort_keys.append("nfl_id") df.sort(sort_keys).write_parquet(out_dir / f"{key}.parquet") + print("\nData preparation complete!") + print(f"Output saved to: {out_dir}") + if __name__ == "__main__": main() diff --git a/src/train.py b/src/train.py index 0985384..d1db665 100644 --- a/src/train.py +++ b/src/train.py @@ -1,7 +1,7 @@ """ -Training Script for NFL Big Data Bowl 2024 Tackle Prediction Models +Training Script for NFL Big Data Bowl 2026 Ball Landing Prediction Models -This module handles the training process for tackle prediction models. It includes +This module handles the training process for ball landing prediction models. It includes functions for loading datasets, predicting using trained models, and conducting hyperparameter searches. @@ -29,8 +29,8 @@ from torch.utils.data import DataLoader from tqdm import tqdm -from datasets import BDB2024_Dataset, load_datasets -from models import LitModel +from src.datasets import BDB2026_Dataset, load_datasets +from src.models import LitModel MODELS_PATH = Path("models") MODELS_PATH.mkdir(exist_ok=True) @@ -59,9 +59,9 @@ def predict_model_as_df(model: LitModel = None, ckpt_path: Path = None, devices= model = LitModel.load_from_checkpoint(ckpt_path) # Load datasets - train_ds: BDB2024_Dataset = load_datasets(model.model_type, split="train") - val_ds: BDB2024_Dataset = load_datasets(model.model_type, split="val") - test_ds: BDB2024_Dataset = load_datasets(model.model_type, split="test") + train_ds: BDB2026_Dataset = load_datasets(model.model_type, split="train") + val_ds: BDB2026_Dataset = load_datasets(model.model_type, split="val") + test_ds: BDB2026_Dataset = load_datasets(model.model_type, split="test") # Create unshuffled dataloaders for prediction dataloaders = { @@ -78,7 +78,7 @@ def predict_model_as_df(model: LitModel = None, ckpt_path: Path = None, devices= preds: np.ndarray = torch.concat(preds, dim=0).cpu().numpy() # Prepare metadata - dataset: BDB2024_Dataset = dataloader.dataset + dataset: BDB2026_Dataset = dataloader.dataset tgt_df = pl.from_pandas(dataset.tgt_df_partition, include_index=True) ds_keys = np.array(dataset.keys) @@ -89,24 +89,20 @@ def predict_model_as_df(model: LitModel = None, ckpt_path: Path = None, devices= tgt_df.join( pl.DataFrame( { - "gameId": ds_keys[:, 0], - "playId": ds_keys[:, 1], - "mirrored": ds_keys[:, 2], - "frameId": ds_keys[:, 3], + "game_id": ds_keys[:, 0], + "play_id": ds_keys[:, 1], + "frame_id": ds_keys[:, 2], "dataset_split": split, - "tackle_x_rel_pred": preds[:, 0].round(2), - "tackle_y_rel_pred": preds[:, 1].round(2), + "ball_land_x_pred": preds[:, 0].round(2), + "ball_land_y_pred": preds[:, 1].round(2), }, - schema_overrides={"mirrored": bool}, ), - on=["gameId", "playId", "mirrored", "frameId"], + on=["game_id", "play_id", "frame_id"], how="inner", ) .with_columns( - tackle_x_rel_pred=pl.col("tackle_x_rel_pred").round(2), - tackle_y_rel_pred=pl.col("tackle_y_rel_pred").round(2), - tackle_x_pred=(pl.col("tackle_x_rel_pred") + pl.col("anchor_x")).round(2), - tackle_y_pred=(pl.col("tackle_y_rel_pred") + pl.col("anchor_y")).round(2), + ball_land_x_pred=pl.col("ball_land_x_pred").round(2), + ball_land_y_pred=pl.col("ball_land_y_pred").round(2), ) # add model hparams to pred df .with_columns(**{k: pl.lit(v) for k, v in model.hparams.items()}) @@ -212,8 +208,8 @@ def train_model( return lit_model # Load datasets - train_ds: BDB2024_Dataset = load_datasets(model_type, split="train") - val_ds: BDB2024_Dataset = load_datasets(model_type, split="val") + train_ds: BDB2026_Dataset = load_datasets(model_type, split="train") + val_ds: BDB2026_Dataset = load_datasets(model_type, split="val") # Create dataloaders train_dataloader = DataLoader(train_ds, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=30)