Skip to content

Download ERA5 Data

Canari-ML provides a flexible configuration system using Hydra to download ERA5 reanalysis data. This guide explains how to use the canari_ml download command, including overriding default settings via CLI arguments or custom config files.

Default Configuration

The main command for downloading data is canari_ml download. To find the default options, and what configuration options can be changed, run:

$ canari_ml download --help
download is powered by Hydra.

== Configuration groups ==
Compose your configuration from those groups (group=option)

callbacks: default, early_stopping, model_checkpoint
common: default
hydra_config: predict, train
logger: csv, tensorboard, wandb
model: default, unet
paths: default, download, plot, postprocess, predict, preprocess, train
plot: default, ua700
postprocess: default, netcdf, plot_ua700
predict: default
preprocess: default
profiler: pytorch
train: default
trainer: default


== Config ==
Override anything in the config (foo.bar=value)

frequency: DAY
output_group_by: YEAR
hemisphere: north
workers: 4
delete_cache: false
cache_only: false
overwrite_config: true
vars:
- zg
- ua
- va
- tos
- tas
- sic
levels:
- 2|10|50|100|250|500|700
- 2|10|50|100|250|500|700
- 2|10|50|100|250|500|700
- null
- null
- null
dates:
  start:
  - '1979-01-01'
  end:
  - '2024-12-31'
paths:
  download:
    source_data_dir: ${hydra:runtime.cwd}/data
    config_file: ${paths.download.source_data_dir}/data.aws.${frequency}.${hemisphere}.json


Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help

The source of this data is NSF NCAR which hosts a mirror of ERA5 dataset on AWS S3 (amongst other access methods). This uses download-toolbox from the environmental-forecasting initiative.

General config options

Name Type Default Value Description
frequency str DAY The temporal resolution of the data to download.
output_group_by str YEAR How output files are grouped (e.g., by YEAR or MONTH).
hemisphere str north Which hemisphere to download data for (north or south).
workers int 4 Number of parallel workers for downloading.
delete_cache bool false Whether to delete cached files after processing.
cache_only bool false Only download and cache data without processing.
overwrite_config bool true Overwrite existing configuration files during setup.

Variables

Variables available for download:

CMIP6 variable name Description ECMWF ID ECMWF Short Name Dataset Comments
hus Specific humidity 133 q pressure-level Mass of water vapour per kilogram of moist air (kg kg-1)
ta Air temperature 130 t pressure-level Temperature in the atmosphere (K)
ua Zonal wind component 131 u pressure-level Eastward component of the wind (m/s)
va Meridional wind component 132 v pressure-level Horizontal speed of air moving towards the north (m/s)
zg Geopotential height 129 z pressure-level This downloads geopotential {z}, which is converted during the dataset preprocess step to geopotential height {zg}
ps Surface pressure 134 sp surface-level Pressure (force per unit area) of the atmosphere on the surface of land, sea and in-land water (Pa)
psl Sea level pressure 151 msl surface-level Pressure (force per unit area) of the atmosphere adjusted to the height of mean sea level (Pa)
sic Sea ice concentration 262001 ci surface-level Fraction of a grid box which is covered by sea ice (1)
tas Near-surface air temperature 167 2t surface-level Temperature of air at 2m above the surface of land, sea or in-land waters (K)
tos Sea Surface Temperature 34 sstk surface-level Temperature of sea water near the surface (K)
Note

While there are additional variables that are available for download from the ERA5 AWS data mirror, the variables above are the only ones that have been mapped and downloadable through this interface currently.

Pressure Levels

Pressure levels for variables:

Name Type Default Value Description
levels list[str] [2|10|50|100|250|500|700] Pressure levels for the selected variables.

For more information on what pressure levels are available, check out the source data documentation.

Your options for pressure levels are:

[1|2|3|5|7|10|20|30|50|70|100|125|150|175|200|225|250|300|350|400|450|500|550|600|650|700|750|775|800|825|850|875|900|925|950|975|1000]

Dates

Configuring date ranges to download:

Name Type Default Value Description
dates.start str 1979-01-01 Start date of data download.
dates.end str 2024-12-31 End date of data download.

This configuration structure allows you to customise your data downloading and processing workflow. Use these tables as a reference when setting up your config.yml file.

Summary

The canari_ml download command offers extensive flexibility through Hydra's configuration system. You can:

  1. Override defaults directly via CLI arguments.
  2. Use custom YAML config files to define complex overrides of the default behaviour.
  3. Combine CLI and config file overrides, with CLI taking precedence.

This allows you to tailor the download process for the variables you want to train with.

Next Steps

After downloading the necessary data, you can proceed with preprocessing this data to get it ready for training.