Training

metno · Jun 23, 2024 · 25c3803 · 25c3803
1 parent c710786
commit 25c3803
Show file tree

Hide file tree

Showing 5 changed files with 62 additions and 20 deletions.
diff --git a/index.html b/index.html
@@ -9,12 +9,13 @@
  <p>
  This is a quick-start tutorial on regional data-driven modelling with Anemoi. The content is
  developed by Håvard Homleid Haugen, Magnus Sikora Ingstad, Thomas Nipen, Even Nordhagen, Aram
- Farhad Salihi, Ivar Seierstad, and Paulina Tedesco.
+ Farhad Salihi, Ivar Seierstad, and Paulina Tedesco. Contact thomasn@met.no if you find errors or
+ have suggestions.
  </p>
 
- <p>
+ <!--p>
  <b>CURRENTLY UNDER DEVELOPMENT</b>
- </p>
+ </p-->
 
  <!--h1>Tutorial</h1-->
 

diff --git a/tutorial/links.markdown → tutorial/_devel/links.markdown b/tutorial/links.markdown → tutorial/_devel/links.markdown
diff --git a/tutorial/_devel/overview.markdown b/tutorial/_devel/overview.markdown
@@ -0,0 +1,23 @@
+---
+layout: post
+title: "Overview"
+date: 2024-06-14 09:00:00 +0200
+author: Thomas Nipen (thomasn@met.no)
+order: 0
+toc: true
+tags: Anemoi
+---
+
+The Anemoi framework consists of several python packages. *aifs-mono*
+
+
+Here is a summary of the current combination of branches and repos needed for regional modelling. This will be
+updated as changes are integrated into the main part of Anemoi.
+
+| Package | Location | Branch |
+| ------- | -------- | ------ |
+| aifs-mono | github.com/ecmwf-lab | hackathon |
+| anemoi-datasets | github.com/metno | feature-branch |
+| anemoi-models | github.com/metno | feature/graph_refactor |
+| anemoi-utils | github.com/ecmwf | main |
+
diff --git a/tutorial/datasets.markdown b/tutorial/datasets.markdown
@@ -10,9 +10,9 @@ tags: anemoi
 
 ## Downloading existing datasets
 
-To see what datasets area already available, checkout https://anemoi.ecmwf.int/datasets (requires ECMWF login
-credentials). The site provides download links to files in S3 buckets, and paths to where files are located
-on LUMI and Leonardo.
+To see what datasets area already available, checkout [https://anemoi.ecmwf.int/datasets](https://anemoi.ecmwf.int/datasets)
+(requires ECMWF login credentials). The site provides download links to files in S3 buckets, and paths to where
+files are located on LUMI and Leonardo.
 
 ## Creating your own a dataset
 

diff --git a/tutorial/training.markdown b/tutorial/training.markdown
@@ -22,24 +22,27 @@ model, which we will call: `config_regional.yaml`
 {% include files/training/config_regional.yaml %}
 {% endhighlight %}
 
-The part under "default" loads options from . The rest overrides these.
-
 The first part specifies the default options from the specified config files. For example, `data: zarr` loads
-data options from `config/data/zarr.yaml`.
+data options from `aifs-mono/aifs/config/data/zarr.yaml`.
 
-The options after the `default` section override specific options provided by default. For example, we
-override the number of channels in our model to 512 (which is 1024 in the default configuration).
+The options after the `defaults` section override specific options provided by default. For example, we
+override the number of gpus per node to 8 (which is 1 in the default atos configuration that we loaded for
+hardware).
 
 The options that are important for us are:
-- hardware.paths.data: Base directory where datasets are stored
-- hardware.paths.output: where will model checkpoints and plots be stored
-- hardware.graphs
-
-### Diagnostics
-
-You can enable your training run to log to ML-flow by setting `enabled: True` under `mlflow`. The value you
-set for `experiment_name` to create a group many of your runs. `run_name` should be something uniquely
-describing one specific training run.
+- **hardware.num_gpus_per_node**: Set this to 8 on LUMI (as there are 8 GPU partitions per node). Other compute
+- **hardware.num_gpus_per_model**: This specifies model paralellism. When running large models on many nodes,
+ consider increasing this.
+clusters might have a different value.
+- **hardware.paths.data**: Base directory where datasets are stored
+- **hardware.paths.output**: Where will model checkpoints and other output data such as plots be stored
+- **hardware.files**: This names the datasets that we will use to train with. Use `dataset` for specifying the
+global dataset and `dataset_lam` for the limited area dataset.
+- **hardware.files.graph**: If you have pre-computed a specific graph, specify this here. Otherwise, a new
+graph will be constructed on the fly.
+- **diagnostic.log.mlflow**: You can enable your training run to log to ML-flow by setting `enabled: True`.
+The value you set for `experiment_name` to create a group many of your runs. `run_name` should be something
+uniquely describing one specific training run.
 
 ### Transfer learning
 
@@ -56,8 +59,23 @@ model:
  hidden2hidden: 0 # GNN and GraphTransformer Processor only
 {% endhighlight %}
 
+Your first training run will not use stretched grid, and will only use the ERA5 dataset. You need to change
+the dataloader section like this:
+{% highlight yaml %}
+dataloader:
+ dataset: ${hardware.paths.data}/${hardware.files.dataset}
+{% endhighlight %}
+
+Also, set `graphs: default` in the `defaults`section at the top (we don't want a stretched grid graph) and
+remove `dataset_lam` in `hardware: files`.
+
 ## Training the model
 
+Finally we can train a regional model! Run this:
+
 {% highlight bash %}
 aifs-train --config-dir ./ --config-name config_regional.yaml
 {% endhighlight %}
+
+If you are running this in the [job script on LUMI]({{ 'getting-started-on-lumi' }}) that we looked at earlier
+by following, just replace `<command_to_run>` with the aifs-train command above.