🌟 Overview
Accurate climate projections are critical in an era of accelerating climate change. Traditional climate models face challenges in representing small-scale atmospheric processes—such as clouds, storms, turbulence, and precipitation—because these processes occur at scales smaller than the model grid and are computationally expensive to resolve explicitly. This project develops a machine learning emulator that replicates the subgrid-scale physics within the E3SM-MMF climate model. By replacing high-resolution physical parameterizations with an efficient neural network, the solution offers a scalable, fast, and physically credible approach to long-term climate prediction.
This project was created as part of the Kaggle LEAP - Atmospheric Physics using AI (ClimSim) competition. The final submission achieved a bronze medal with a public leaderboard (lb) score of 0.73575 and a private leaderboard (pb) score of 0.73955.
The dataset for this competition is generated by the state-of-the-art E3SM-MMF climate model. Its multi-scale framework explicitly resolves small-scale processes (e.g., clouds, storms, turbulence) that influence large-scale climate patterns. However, the computational cost of explicit resolution is extremely high. The task is to emulate the effects of these processes with a machine learning model that is far less computationally expensive.
Each row in the training set corresponds to the inputs and outputs of a cloud-resolving model (CRM) in E3SM-MMF at a given location and timestep. The dataset includes:
556 columns representing 25 input variables. Some variables are scalars while others are vertically resolved over 60 levels. For vertically resolved variables, an underscore and level number (ranging from 0 to 59) are appended to the variable name, with lower numbers representing higher positions in the atmosphere. -
368 columns representing 14 target variables. These include both vertically resolved variables (e.g., heating tendency across 60 levels) and scalars (e.g., surface fluxes).
Name | Description | Dimension | Units |
state_t |
Air temperature | 60 levels | K |
state_q0001 |
Specific humidity | 60 levels | kg/kg |
state_q0002 |
Cloud liquid mixing ratio | 60 levels | kg/kg |
state_q0003 |
Cloud ice mixing ratio | 60 levels | kg/kg |
state_u |
Zonal wind speed | 60 levels | m/s |
state_v |
Meridional wind speed | 60 levels | m/s |
state_ps |
Surface pressure | Scalar | Pa |
pbuf_SOLIN |
Solar insolation | Scalar | W/m² |
pbuf_LHFLX |
Surface latent heat flux | Scalar | W/m² |
pbuf_SHFLX |
Surface sensible heat flux | Scalar | W/m² |
pbuf_TAUX |
Zonal surface stress | Scalar | N/m² |
pbuf_TAUY |
Meridional surface stress | Scalar | N/m² |
pbuf_COSZRS |
Cosine of solar zenith angle | Scalar | — |
cam_in_ALDIF |
Albedo for diffuse longwave radiation | Scalar | — |
cam_in_ALDIR |
Albedo for direct longwave radiation | Scalar | — |
cam_in_ASDIF |
Albedo for diffuse shortwave radiation | Scalar | — |
cam_in_ASDIR |
Albedo for direct shortwave radiation | Scalar | — |
cam_in_LWUP |
Upward longwave flux | Scalar | W/m² |
cam_in_ICEFRAC |
Sea-ice areal fraction | Scalar | — |
cam_in_LANDFRAC |
Land areal fraction | Scalar | — |
cam_in_OCNFRAC |
Ocean areal fraction | Scalar | — |
cam_in_SNOWHLAND |
Snow depth over land | Scalar | m |
pbuf_ozone |
Ozone volume mixing ratio | 60 levels | mol/mol |
pbuf_CH4 |
Methane volume mixing ratio | 60 levels | mol/mol |
pbuf_N2O |
Nitrous oxide volume mixing ratio | 60 levels | mol/mol |
Name | Description | Dimension | Units |
ptend_t |
Heating tendency | 60 levels | K/s |
ptend_q0001 |
Moistening tendency | 60 levels | kg/kg/s |
ptend_q0002 |
Cloud liquid mixing ratio change over time | 60 levels | kg/kg/s |
ptend_q0003 |
Cloud ice mixing ratio change over time | 60 levels | kg/kg/s |
ptend_u |
Zonal wind acceleration | 60 levels | m/s² |
ptend_v |
Meridional wind acceleration | 60 levels | m/s² |
cam_out_NETSW |
Net shortwave flux at surface | Scalar | W/m² |
cam_out_FLWDS |
Downward longwave flux at surface | Scalar | W/m² |
cam_out_PRECSC |
Snow rate (liquid water equivalent) | Scalar | m/s |
cam_out_PRECC |
Rain rate | Scalar | m/s |
cam_out_SOLS |
Downward visible direct solar flux to surface | Scalar | W/m² |
cam_out_SOLL |
Downward near-infrared direct solar flux to surface | Scalar | W/m² |
cam_out_SOLSD |
Downward diffuse solar flux to surface | Scalar | W/m² |
cam_out_SOLLD |
Downward diffuse near-infrared solar flux to surface | Scalar | W/m² |
Efficient Data Handling:
CSV files are converted into more efficient formats (such as TFRecord, Parquet, or Numpy arrays) to improve input/output performance. -
Both inputs and targets are normalized using TensorFlow’sNormalization
layer to ensure stable and effective training. -
Deterministic Data Pipeline:
A reproducible data pipeline is built using fixed random seeds and deterministic shuffling. A batch size of 512 is used for training.
The model is designed to capture multi-scale atmospheric dynamics by incorporating several advanced components:
Input Reformatting:
A custom function (x_to_seq
) converts the raw 556-dimensional input vector into a sequence format that separates vertically resolved data from scalar variables. -
U-Net Style Architecture with Transformer and Residual Blocks:
- Encoder & Decoder Blocks:
The encoder progressively downsamples the input using repeated residual blocks to extract high-level features, while the decoder upsamples these features to reconstruct the target variables. - Transformer Bottleneck:
A transformer block with multi-head attention layers captures long-range dependencies and complex interactions between vertical levels. - Residual Connections:
Skip connections preserve fine-scale details throughout the network.
- Encoder & Decoder Blocks:
Below is a snippet illustrating the custom transformer encoder layer:
class TransformerEncoderLayer(tf.keras.layers.Layer):
def __init__(self, head_size, num_heads, ff_dim, dropout=0.1, **kwargs):
super(TransformerEncoderLayer, self).__init__(**kwargs)
self.att = MultiHeadAttention(key_dim=head_size, num_heads=num_heads, dropout=dropout)
self.ffn = tf.keras.Sequential([
Dense(ff_dim, activation='gelu'),
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = Dropout(dropout)
self.dropout2 = Dropout(dropout)
def call(self, inputs, training=False):
attn_output = self.att(inputs, inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
Ensemble Modeling
Multiple model variants (including different U-Net and transformer configurations) are trained, and predictions are averaged. This ensemble strategy increases robustness and overall performance.
Training Strategy
Loss Function & Metrics:
The model is trained using Mean Squared Error (MSE) loss. Performance is evaluated with a custom weighted R² metric:
$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$ where residuals are weighted element-wise by the values from
. -
- Early Stopping: Monitors validation loss to prevent overfitting.
- Model Checkpointing: Saves the best model during training based on validation performance.
- Learning Rate Scheduling: A cosine annealing schedule with warm-up phases is used to stabilize and speed up training.
After predictions are generated, they are scaled back to the original target values using stored mean and standard deviation values.
Weighting Predictions
Final predictions are multiplied element-wise by the sample submission weights.
Ensemble Averaging
Predictions from multiple models are averaged before generating the final submission file.
Evaluation & Results
Evaluation is performed using a custom weighted R² metric, where predictions are weighted by the values in
. Final leaderboard results were:- Public Score (lb): 0.73575
- Private Score (pb): 0.73955