Skip to content

Preprocessing Data

Overview

This section guides you through the use of the preprocessing command the generate the necessary dataset for training and prediction. You will learn how to use the canari_ml preprocess command to preprocess the source ERA5 data in such a manner that it is ready for ingestion into the ML model. And, like in previous sections, you can override default settings via command-line arguments or YAML configuration files, or both.


Getting Started

Prerequisites

For a training dataset:

  • Follow all previous steps to download source data for all required variables, ready for preprocessing.
  • This should include data for time before and after the dates being used for training/prediction because the model utilises historical data (defined by lag_length) to predict future steps.

For a prediction dataset:

  • Ensure a training dataset has already been generated.
  • The normalisation parameters from this training dataset is used to normalise the prediction dataset against.
  • A trained model that you want to generate predictions against.
  • The trained model symlinks to the location of the training dataset used to create the trained model which the code uses to figure out what the normalisation parameters were.

Usage

The canari_ml preprocess command preprocesses the ERA5 data running the following steps:

  • Create train/val/predict data splits across input date ranges (either the defaults, or user-specified overrides).
  • Reproject the data from source CRS of EPSG:4326 to EPSG:6931 (by default, but is configurable).
  • Normalise the dataset to transform the range of variables to a standard scale.
  • Will ensure that each variable contributes proportionally to the model's learning process, else, certain variables may have higher/lower weighting based on their range.
  • If geopotential (z) is defined, convert it to geopotential height (zg).
  • Apply hemisphere mask to mask out the region below 0° latitude (i.e., masking out the Southern hemisphere).

To see the subcommands available, run:

$ canari_ml preprocess --help
usage: canari_ml preprocess [-h] {train,predict} ...

positional arguments:
  {train,predict}

options:
  -h, --help       show this help message and exit

Training Subcommand

The train subcommand applies the preprocessing steps, then processes the normalised data to generate a training dataset to Zarr format and a corresponding JSON config file, ready for GPU training.

Basic Usage

$ canari_ml preprocess train --help
train_56480d80 is powered by Hydra.

== Configuration groups ==
Compose your configuration from those groups (group=option)

callbacks: default, early_stopping, model_checkpoint
common: default
hydra_config: predict, train
logger: csv, tensorboard, wandb
model: default, unet
paths: default, download, plot, postprocess, predict, preprocess, train
plot: default, ua700
postprocess: default, netcdf, plot_ua700
predict: default
preprocess: default
profiler: pytorch
train: default
trainer: default


== Config ==
Override anything in the config (foo.bar=value)

input:
  name: ''
  forecast_length: 3
  lag_length: ${input.forecast_length}
  vars:
    absolute:
    - sic
    - tas
    - tos
    - ua2
    - ua10
    - ua50
    - ua100
    - ua250
    - ua500
    - ua700
    anomaly:
    - zg2
    - zg10
    - zg50
    - zg100
    - zg250
    - zg500
    - zg700
  dates:
    train:
      start:
      - 2000-1-5
      end:
      - 2000-1-5
    val:
      start:
      - 2000-4-5
      end:
      - 2000-4-5
    test:
      start:
      - 2000-2-1
      end:
      - 2000-2-1
    predict:
      start:
      - 2000-2-1
      end:
      - 2000-2-1
reproject:
  source_crs: EPSG:4326
  target_crs: EPSG:6931
  shape: 500
preprocess_era5:
  implementation: canari_ml.data.processors.cds:ERA5PreProcessor
  smooth_sigma: 0.5
preprocess_mask:
  implementation: canari_ml.data.masks.era5:Masks
  channel_name: hemisphere
preprocess_cache:
  implementation: serial
  output_batch_size: 4
workers: 16
frequency: DAY
hemisphere: north
params:
  config_suffix: ${frequency}.${hemisphere}
  config_name: ${hydra:job.name}
hashes:
  reproject: ${compute_step_hash:[${input}, ${preprocess_type}, ${reproject}], ${input.name}}
  preprocess_era5: ${compute_step_hash:[${hashes.reproject}, ${preprocess_era5}],
    ${input.name}}
  preprocess_mask: ${compute_step_hash:[${hashes.preprocess_era5}, ${preprocess_mask}],
    ${input.name}}
  cache: ${compute_step_hash:[${hashes.preprocess_mask}, ${preprocess_cache}], ${input.name}}
  combined: ${compute_step_hash:[${hashes.preprocess_mask}], ${input.name}}
preprocess_type: train
source_dataset_id: era5
paths:
  download:
    source_data_dir: ${hydra:runtime.cwd}/data
    config_file: ${paths.download.source_data_dir}/data.aws.${frequency}.${hemisphere}.json
  preprocess:
    data_root: ${getcwd:}/preprocessed_data
    preprocessed_data_dir: ${paths.preprocess.data_root}/preprocessed
    reprojected_data_dir: ${paths.preprocess.preprocessed_data_dir}
    normalised_data_dir: ${paths.preprocess.preprocessed_data_dir}
    cache_dir: ${paths.preprocess.data_root}/cache
    loader_file: ${paths.preprocess.preprocessed_data_dir}/loader.${params.config_name}.json
  reproject:
    destination_path: ${paths.preprocess.reprojected_data_dir}/01_reproject${opt_underscore:${input.name}}${opt_underscore:${hashes.reproject}}
    config_file: ${paths.reproject.destination_path}/reproject.${params.config_suffix}.json
  preprocess_era5:
    destination_path: ${paths.preprocess.normalised_data_dir}/02_normalised${opt_underscore:${input.name}}${opt_underscore:${hashes.preprocess_era5}}
    config_file: ${paths.preprocess_era5.destination_path}/processed_era5.${params.config_suffix}.json
  mask:
    destination_path: ${paths.preprocess_era5.destination_path}/mask${opt_underscore:${input.name}}${opt_underscore:${hashes.preprocess_mask}}
    mask_dataset_config_path: ${paths.mask.destination_path}/dataset_config.masks.${params.config_suffix}.json
    mask_config_path: ${paths.mask.destination_path}/processed_era5.masks.${params.config_suffix}.json
  cache:
    destination_path: ${paths.preprocess.cache_dir}/03_cache${opt_underscore:${input.name}}${opt_underscore:${hashes.cache}}
    config_path: ${paths.cache.destination_path}/cached.${params.config_suffix}.json
  train: outputs/${train.name}/training/


Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help

This displays the help menu with all available default configuration options. And, will inform you of what options are available to override.

Running the Training Process

To execute the training preprocessing with the defaults:

canari_ml preprocess train

Prediction Subcommand

The predict subcommand applies the preprocessing steps, then just outputs the JSON config file without generating a cached dataset since there would not be of much performance benefit in trying to cache the dataset for prediction.

Basic Usage

$ canari_ml preprocess predict --help
predict_67baf29b is powered by Hydra.

== Configuration groups ==
Compose your configuration from those groups (group=option)

callbacks: default, early_stopping, model_checkpoint
common: default
hydra_config: predict, train
logger: csv, tensorboard, wandb
model: default, unet
paths: default, download, plot, postprocess, predict, preprocess, train
plot: default, ua700
postprocess: default, netcdf, plot_ua700
predict: default
preprocess: default
profiler: pytorch
train: default
trainer: default


== Config ==
Override anything in the config (foo.bar=value)

input:
  name: ''
  forecast_length: 3
  lag_length: ${input.forecast_length}
  vars:
    absolute:
    - sic
    - tas
    - tos
    - ua2
    - ua10
    - ua50
    - ua100
    - ua250
    - ua500
    - ua700
    anomaly:
    - zg2
    - zg10
    - zg50
    - zg100
    - zg250
    - zg500
    - zg700
  dates:
    train:
      start:
      - 2000-1-5
      end:
      - 2000-1-5
    val:
      start:
      - 2000-4-5
      end:
      - 2000-4-5
    test:
      start:
      - 2000-2-1
      end:
      - 2000-2-1
    predict:
      start:
      - 2000-2-1
      end:
      - 2000-2-1
reproject:
  source_crs: EPSG:4326
  target_crs: EPSG:6931
  shape: 500
preprocess_era5:
  implementation: canari_ml.data.processors.cds:ERA5PreProcessor
  smooth_sigma: 0.5
preprocess_mask:
  implementation: canari_ml.data.masks.era5:Masks
  channel_name: hemisphere
preprocess_cache:
  implementation: serial
  output_batch_size: 4
workers: 16
frequency: DAY
hemisphere: north
params:
  config_suffix: ${frequency}.${hemisphere}
  config_name: ${hydra:job.name}
hashes:
  reproject: ${compute_step_hash:[${input}, ${preprocess_type}, ${reproject}], ${input.name}}
  preprocess_era5: ${compute_step_hash:[${hashes.reproject}, ${preprocess_era5}],
    ${input.name}}
  preprocess_mask: ${compute_step_hash:[${hashes.preprocess_era5}, ${preprocess_mask}],
    ${input.name}}
  cache: ${compute_step_hash:[${hashes.preprocess_mask}, ${preprocess_cache}], ${input.name}}
  combined: ${compute_step_hash:[${hashes.preprocess_mask}], ${input.name}}
preprocess_type: predict
source_dataset_id: era5
paths:
  download:
    source_data_dir: ${hydra:runtime.cwd}/data
    config_file: ${paths.download.source_data_dir}/data.aws.${frequency}.${hemisphere}.json
  preprocess:
    data_root: ${getcwd:}/preprocessed_data
    preprocessed_data_dir: ${paths.preprocess.data_root}/preprocessed
    reprojected_data_dir: ${paths.preprocess.preprocessed_data_dir}
    normalised_data_dir: ${paths.preprocess.preprocessed_data_dir}
    cache_dir: ${paths.preprocess.data_root}/cache
    loader_file: ${paths.preprocess.preprocessed_data_dir}/loader.${params.config_name}.json
  reproject:
    destination_path: ${paths.preprocess.reprojected_data_dir}/01_reproject${opt_underscore:${input.name}}${opt_underscore:${hashes.reproject}}
    config_file: ${paths.reproject.destination_path}/reproject.${params.config_suffix}.json
  preprocess_era5:
    destination_path: ${paths.preprocess.normalised_data_dir}/02_normalised${opt_underscore:${input.name}}${opt_underscore:${hashes.preprocess_era5}}
    config_file: ${paths.preprocess_era5.destination_path}/processed_era5.${params.config_suffix}.json
  mask:
    destination_path: ${paths.preprocess_era5.destination_path}/mask${opt_underscore:${input.name}}${opt_underscore:${hashes.preprocess_mask}}
    mask_dataset_config_path: ${paths.mask.destination_path}/dataset_config.masks.${params.config_suffix}.json
    mask_config_path: ${paths.mask.destination_path}/processed_era5.masks.${params.config_suffix}.json
  cache:
    destination_path: ${paths.preprocess.cache_dir}/03_cache${opt_underscore:${input.name}}${opt_underscore:${hashes.cache}}
    config_path: ${paths.cache.destination_path}/cached.${params.config_suffix}.json
  train: outputs/${train.name}/training/


Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help

This displays the help menu for the prediction command.

Running the Prediction Process

To execute the prediction preprocessing using the defaults:

canari_ml preprocess predict