Download ERA5 Data¶
Canari-ML provides a flexible configuration system using Hydra to download ERA5 reanalysis data. This guide explains how to use the canari_ml download command, including overriding default settings via CLI arguments or custom config files.
Default Configuration¶
The main command for downloading data is canari_ml download. To find the default options, and what configuration options can be changed, run:
download is powered by Hydra.
== Configuration groups ==
Compose your configuration from those groups (group=option)
callbacks: default, early_stopping, model_checkpoint
common: default
hydra_config: predict, train
logger: csv, tensorboard, wandb
model: default, unet
paths: default, download, plot, postprocess, predict, preprocess, train
plot: default, ua700
postprocess: default, netcdf, plot_ua700
predict: default
preprocess: default
profiler: pytorch
train: default
trainer: default
== Config ==
Override anything in the config (foo.bar=value)
frequency: DAY
output_group_by: YEAR
hemisphere: north
workers: 4
delete_cache: false
cache_only: false
overwrite_config: true
vars:
- zg
- ua
- va
- tos
- tas
- sic
levels:
- 2|10|50|100|250|500|700
- 2|10|50|100|250|500|700
- 2|10|50|100|250|500|700
- null
- null
- null
dates:
start:
- '1979-01-01'
end:
- '2024-12-31'
paths:
download:
source_data_dir: ${hydra:runtime.cwd}/data
config_file: ${paths.download.source_data_dir}/data.aws.${frequency}.${hemisphere}.json
Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help
The source of this data is NSF NCAR which hosts a mirror of ERA5 dataset on AWS S3 (amongst other access methods). This uses download-toolbox from the environmental-forecasting initiative.
General config options¶
| Name | Type | Default Value | Description |
|---|---|---|---|
| frequency | str | DAY |
The temporal resolution of the data to download. |
| output_group_by | str | YEAR |
How output files are grouped (e.g., by YEAR or MONTH). |
| hemisphere | str | north |
Which hemisphere to download data for (north or south). |
| workers | int | 4 | Number of parallel workers for downloading. |
| delete_cache | bool | false | Whether to delete cached files after processing. |
| cache_only | bool | false | Only download and cache data without processing. |
| overwrite_config | bool | true | Overwrite existing configuration files during setup. |
Variables¶
Variables available for download:
| CMIP6 variable name | Description | ECMWF ID | ECMWF Short Name | Dataset | Comments |
|---|---|---|---|---|---|
| hus | Specific humidity | 133 | q | pressure-level | Mass of water vapour per kilogram of moist air (kg kg-1) |
| ta | Air temperature | 130 | t | pressure-level | Temperature in the atmosphere (K) |
| ua | Zonal wind component | 131 | u | pressure-level | Eastward component of the wind (m/s) |
| va | Meridional wind component | 132 | v | pressure-level | Horizontal speed of air moving towards the north (m/s) |
| zg | Geopotential height | 129 | z | pressure-level | This downloads geopotential {z}, which is converted during the dataset preprocess step to geopotential height {zg} |
| ps | Surface pressure | 134 | sp | surface-level | Pressure (force per unit area) of the atmosphere on the surface of land, sea and in-land water (Pa) |
| psl | Sea level pressure | 151 | msl | surface-level | Pressure (force per unit area) of the atmosphere adjusted to the height of mean sea level (Pa) |
| sic | Sea ice concentration | 262001 | ci | surface-level | Fraction of a grid box which is covered by sea ice (1) |
| tas | Near-surface air temperature | 167 | 2t | surface-level | Temperature of air at 2m above the surface of land, sea or in-land waters (K) |
| tos | Sea Surface Temperature | 34 | sstk | surface-level | Temperature of sea water near the surface (K) |
Note
While there are additional variables that are available for download from the ERA5 AWS data mirror, the variables above are the only ones that have been mapped and downloadable through this interface currently.
Pressure Levels¶
Pressure levels for variables:
| Name | Type | Default Value | Description |
|---|---|---|---|
| levels | list[str] | [2|10|50|100|250|500|700] |
Pressure levels for the selected variables. |
For more information on what pressure levels are available, check out the source data documentation.
Your options for pressure levels are:
[1|2|3|5|7|10|20|30|50|70|100|125|150|175|200|225|250|300|350|400|450|500|550|600|650|700|750|775|800|825|850|875|900|925|950|975|1000]
Dates¶
Configuring date ranges to download:
| Name | Type | Default Value | Description |
|---|---|---|---|
| dates.start | str | 1979-01-01 |
Start date of data download. |
| dates.end | str | 2024-12-31 |
End date of data download. |
This configuration structure allows you to customise your data downloading and processing workflow. Use these tables as a reference when setting up your config.yml file.
Summary¶
The canari_ml download command offers extensive flexibility through Hydra's configuration system. You can:
- Override defaults directly via CLI arguments.
- Use custom YAML config files to define complex overrides of the default behaviour.
- Combine CLI and config file overrides, with CLI taking precedence.
This allows you to tailor the download process for the variables you want to train with.
Next Steps¶
After downloading the necessary data, you can proceed with preprocessing this data to get it ready for training.