LEAP - Atmospheric Physics using AI (ClimSim)

🌟 Overview

Accurate climate projections are critical in an era of accelerating climate change. Traditional climate models face challenges in representing small-scale atmospheric processes—such as clouds, storms, turbulence, and precipitation—because these processes occur at scales smaller than the model grid and are computationally expensive to resolve explicitly. This project develops a machine learning emulator that replicates the subgrid-scale physics within the E3SM-MMF climate model. By replacing high-resolution physical parameterizations with an efficient neural network, the solution offers a scalable, fast, and physically credible approach to long-term climate prediction.

This project was created as part of the Kaggle LEAP - Atmospheric Physics using AI (ClimSim) competition. The final submission achieved a bronze medal with a public leaderboard (lb) score of 0.73575 and a private leaderboard (pb) score of 0.73955.

Dataset Description

The dataset for this competition is generated by the state-of-the-art E3SM-MMF climate model. Its multi-scale framework explicitly resolves small-scale processes (e.g., clouds, storms, turbulence) that influence large-scale climate patterns. However, the computational cost of explicit resolution is extremely high. The task is to emulate the effects of these processes with a machine learning model that is far less computationally expensive.

Each row in the training set corresponds to the inputs and outputs of a cloud-resolving model (CRM) in E3SM-MMF at a given location and timestep. The dataset includes:

Inputs:
556 columns representing 25 input variables. Some variables are scalars while others are vertically resolved over 60 levels. For vertically resolved variables, an underscore and level number (ranging from 0 to 59) are appended to the variable name, with lower numbers representing higher positions in the atmosphere.
Targets:
368 columns representing 14 target variables. These include both vertically resolved variables (e.g., heating tendency across 60 levels) and scalars (e.g., surface fluxes).

Input Variables

Name	Description	Dimension	Units
`state_t`	Air temperature	60 levels	K
`state_q0001`	Specific humidity	60 levels	kg/kg
`state_q0002`	Cloud liquid mixing ratio	60 levels	kg/kg
`state_q0003`	Cloud ice mixing ratio	60 levels	kg/kg
`state_u`	Zonal wind speed	60 levels	m/s
`state_v`	Meridional wind speed	60 levels	m/s
`state_ps`	Surface pressure	Scalar	Pa
`pbuf_SOLIN`	Solar insolation	Scalar	W/m²
`pbuf_LHFLX`	Surface latent heat flux	Scalar	W/m²
`pbuf_SHFLX`	Surface sensible heat flux	Scalar	W/m²
`pbuf_TAUX`	Zonal surface stress	Scalar	N/m²
`pbuf_TAUY`	Meridional surface stress	Scalar	N/m²
`pbuf_COSZRS`	Cosine of solar zenith angle	Scalar	—
`cam_in_ALDIF`	Albedo for diffuse longwave radiation	Scalar	—
`cam_in_ALDIR`	Albedo for direct longwave radiation	Scalar	—
`cam_in_ASDIF`	Albedo for diffuse shortwave radiation	Scalar	—
`cam_in_ASDIR`	Albedo for direct shortwave radiation	Scalar	—
`cam_in_LWUP`	Upward longwave flux	Scalar	W/m²
`cam_in_ICEFRAC`	Sea-ice areal fraction	Scalar	—
`cam_in_LANDFRAC`	Land areal fraction	Scalar	—
`cam_in_OCNFRAC`	Ocean areal fraction	Scalar	—
`cam_in_SNOWHLAND`	Snow depth over land	Scalar	m
`pbuf_ozone`	Ozone volume mixing ratio	60 levels	mol/mol
`pbuf_CH4`	Methane volume mixing ratio	60 levels	mol/mol
`pbuf_N2O`	Nitrous oxide volume mixing ratio	60 levels	mol/mol

Target Variables

Name	Description	Dimension	Units
`ptend_t`	Heating tendency	60 levels	K/s
`ptend_q0001`	Moistening tendency	60 levels	kg/kg/s
`ptend_q0002`	Cloud liquid mixing ratio change over time	60 levels	kg/kg/s
`ptend_q0003`	Cloud ice mixing ratio change over time	60 levels	kg/kg/s
`ptend_u`	Zonal wind acceleration	60 levels	m/s²
`ptend_v`	Meridional wind acceleration	60 levels	m/s²
`cam_out_NETSW`	Net shortwave flux at surface	Scalar	W/m²
`cam_out_FLWDS`	Downward longwave flux at surface	Scalar	W/m²
`cam_out_PRECSC`	Snow rate (liquid water equivalent)	Scalar	m/s
`cam_out_PRECC`	Rain rate	Scalar	m/s
`cam_out_SOLS`	Downward visible direct solar flux to surface	Scalar	W/m²
`cam_out_SOLL`	Downward near-infrared direct solar flux to surface	Scalar	W/m²
`cam_out_SOLSD`	Downward diffuse solar flux to surface	Scalar	W/m²
`cam_out_SOLLD`	Downward diffuse near-infrared solar flux to surface	Scalar	W/m²

Approach

1. Data Preprocessing

Efficient Data Handling:
CSV files are converted into more efficient formats (such as TFRecord, Parquet, or Numpy arrays) to improve input/output performance.
Normalization:
Both inputs and targets are normalized using TensorFlow’s Normalization layer to ensure stable and effective training.
Deterministic Data Pipeline:
A reproducible data pipeline is built using fixed random seeds and deterministic shuffling. A batch size of 512 is used for training.

2. Model Architecture

The model is designed to capture multi-scale atmospheric dynamics by incorporating several advanced components:

Input Reformatting:
A custom function (x_to_seq) converts the raw 556-dimensional input vector into a sequence format that separates vertically resolved data from scalar variables.
U-Net Style Architecture with Transformer and Residual Blocks:
- Encoder & Decoder Blocks:
  The encoder progressively downsamples the input using repeated residual blocks to extract high-level features, while the decoder upsamples these features to reconstruct the target variables.
- Transformer Bottleneck:
  A transformer block with multi-head attention layers captures long-range dependencies and complex interactions between vertical levels.
- Residual Connections:
  Skip connections preserve fine-scale details throughout the network.

Below is a snippet illustrating the custom transformer encoder layer:

@keras.saving.register_keras_serializable()
class TransformerEncoderLayer(tf.keras.layers.Layer):
    def __init__(self, head_size, num_heads, ff_dim, dropout=0.1, **kwargs):
        super(TransformerEncoderLayer, self).__init__(**kwargs)
        self.att = MultiHeadAttention(key_dim=head_size, num_heads=num_heads, dropout=dropout)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation='gelu'),
            Dense(25)
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(dropout)
        self.dropout2 = Dropout(dropout)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

3. Ensemble Modeling and Training Strategy

Ensemble Modeling

Multiple model variants (including different U-Net and transformer configurations) are trained, and predictions are averaged. This ensemble strategy increases robustness and overall performance.
Training Strategy
- Loss Function & Metrics:
  
  The model is trained using Mean Squared Error (MSE) loss. Performance is evaluated with a custom weighted R² metric:
  
  $$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$
  
  where residuals are weighted element-wise by the values from sample_submission.csv.
- Callbacks
  - Early Stopping: Monitors validation loss to prevent overfitting.
  - Model Checkpointing: Saves the best model during training based on validation performance.
  - Learning Rate Scheduling: A cosine annealing schedule with warm-up phases is used to stabilize and speed up training.

4. Prediction and Submission

Post-Processing

After predictions are generated, they are scaled back to the original target values using stored mean and standard deviation values.
Weighting Predictions

Final predictions are multiplied element-wise by the sample submission weights.
Ensemble Averaging

Predictions from multiple models are averaged before generating the final submission file.
Evaluation & Results

Evaluation is performed using a custom weighted R² metric, where predictions are weighted by the values in sample_submission.csv. Final leaderboard results were:
- Public Score (lb): 0.73575
- Private Score (pb): 0.73955

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LEAP - Atmospheric Physics using AI (ClimSim)

Dataset Description

Input Variables

Target Variables

Approach

1. Data Preprocessing

2. Model Architecture

3. Ensemble Modeling and Training Strategy

4. Prediction and Submission

About

Releases

Packages

luv003/LEAP---Atmospheric-Physics-using-AI-ClimSim-

Folders and files

Latest commit

History

Repository files navigation

LEAP - Atmospheric Physics using AI (ClimSim)

Dataset Description

Input Variables

Target Variables

Approach

1. Data Preprocessing

2. Model Architecture

3. Ensemble Modeling and Training Strategy

4. Prediction and Submission

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages