Preprocessing Data¶
Overview¶
This section guides you through the use of the preprocessing command the generate the necessary dataset for training and prediction. You will learn how to use the canari_ml preprocess command to preprocess the source ERA5 data in such a manner that it is ready for ingestion into the ML model. And, like in previous sections, you can override default settings via command-line arguments or YAML configuration files, or both.
Getting Started¶
Prerequisites¶
For a training dataset:
- Follow all previous steps to download source data for all required variables, ready for preprocessing.
- This should include data for time before and after the dates being used for training/prediction because the model utilises historical data (defined by
lag_length) to predict future steps.
For a prediction dataset:
- Ensure a training dataset has already been generated.
- The normalisation parameters from this training dataset is used to normalise the prediction dataset against.
- A trained model that you want to generate predictions against.
- The trained model symlinks to the location of the training dataset used to create the trained model which the code uses to figure out what the normalisation parameters were.
Usage¶
The canari_ml preprocess command preprocesses the ERA5 data running the following steps:
- Create train/val/predict data splits across input date ranges (either the defaults, or user-specified overrides).
- Reproject the data from source CRS of
EPSG:4326toEPSG:6931(by default, but is configurable). - Normalise the dataset to transform the range of variables to a standard scale.
- Will ensure that each variable contributes proportionally to the model's learning process, else, certain variables may have higher/lower weighting based on their range.
- If geopotential (
z) is defined, convert it to geopotential height (zg). - Apply hemisphere mask to mask out the region below 0° latitude (i.e., masking out the Southern hemisphere).
To see the subcommands available, run:
Training Subcommand¶
The train subcommand applies the preprocessing steps, then processes the normalised data to generate a training dataset to Zarr format and a corresponding JSON config file, ready for GPU training.
Basic Usage¶
train_56480d80 is powered by Hydra.
== Configuration groups ==
Compose your configuration from those groups (group=option)
callbacks: default, early_stopping, model_checkpoint
common: default
hydra_config: predict, train
logger: csv, tensorboard, wandb
model: default, unet
paths: default, download, plot, postprocess, predict, preprocess, train
plot: default, ua700
postprocess: default, netcdf, plot_ua700
predict: default
preprocess: default
profiler: pytorch
train: default
trainer: default
== Config ==
Override anything in the config (foo.bar=value)
input:
name: ''
forecast_length: 3
lag_length: ${input.forecast_length}
vars:
absolute:
- sic
- tas
- tos
- ua2
- ua10
- ua50
- ua100
- ua250
- ua500
- ua700
anomaly:
- zg2
- zg10
- zg50
- zg100
- zg250
- zg500
- zg700
dates:
train:
start:
- 2000-1-5
end:
- 2000-1-5
val:
start:
- 2000-4-5
end:
- 2000-4-5
test:
start:
- 2000-2-1
end:
- 2000-2-1
predict:
start:
- 2000-2-1
end:
- 2000-2-1
reproject:
source_crs: EPSG:4326
target_crs: EPSG:6931
shape: 500
preprocess_era5:
implementation: canari_ml.data.processors.cds:ERA5PreProcessor
smooth_sigma: 0.5
preprocess_mask:
implementation: canari_ml.data.masks.era5:Masks
channel_name: hemisphere
preprocess_cache:
implementation: serial
output_batch_size: 4
workers: 16
frequency: DAY
hemisphere: north
params:
config_suffix: ${frequency}.${hemisphere}
config_name: ${hydra:job.name}
hashes:
reproject: ${compute_step_hash:[${input}, ${preprocess_type}, ${reproject}], ${input.name}}
preprocess_era5: ${compute_step_hash:[${hashes.reproject}, ${preprocess_era5}],
${input.name}}
preprocess_mask: ${compute_step_hash:[${hashes.preprocess_era5}, ${preprocess_mask}],
${input.name}}
cache: ${compute_step_hash:[${hashes.preprocess_mask}, ${preprocess_cache}], ${input.name}}
combined: ${compute_step_hash:[${hashes.preprocess_mask}], ${input.name}}
preprocess_type: train
source_dataset_id: era5
paths:
download:
source_data_dir: ${hydra:runtime.cwd}/data
config_file: ${paths.download.source_data_dir}/data.aws.${frequency}.${hemisphere}.json
preprocess:
data_root: ${getcwd:}/preprocessed_data
preprocessed_data_dir: ${paths.preprocess.data_root}/preprocessed
reprojected_data_dir: ${paths.preprocess.preprocessed_data_dir}
normalised_data_dir: ${paths.preprocess.preprocessed_data_dir}
cache_dir: ${paths.preprocess.data_root}/cache
loader_file: ${paths.preprocess.preprocessed_data_dir}/loader.${params.config_name}.json
reproject:
destination_path: ${paths.preprocess.reprojected_data_dir}/01_reproject${opt_underscore:${input.name}}${opt_underscore:${hashes.reproject}}
config_file: ${paths.reproject.destination_path}/reproject.${params.config_suffix}.json
preprocess_era5:
destination_path: ${paths.preprocess.normalised_data_dir}/02_normalised${opt_underscore:${input.name}}${opt_underscore:${hashes.preprocess_era5}}
config_file: ${paths.preprocess_era5.destination_path}/processed_era5.${params.config_suffix}.json
mask:
destination_path: ${paths.preprocess_era5.destination_path}/mask${opt_underscore:${input.name}}${opt_underscore:${hashes.preprocess_mask}}
mask_dataset_config_path: ${paths.mask.destination_path}/dataset_config.masks.${params.config_suffix}.json
mask_config_path: ${paths.mask.destination_path}/processed_era5.masks.${params.config_suffix}.json
cache:
destination_path: ${paths.preprocess.cache_dir}/03_cache${opt_underscore:${input.name}}${opt_underscore:${hashes.cache}}
config_path: ${paths.cache.destination_path}/cached.${params.config_suffix}.json
train: outputs/${train.name}/training/
Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help
This displays the help menu with all available default configuration options. And, will inform you of what options are available to override.
Running the Training Process¶
To execute the training preprocessing with the defaults:
Prediction Subcommand¶
The predict subcommand applies the preprocessing steps, then just outputs the JSON config file without generating a cached dataset since there would not be of much performance benefit in trying to cache the dataset for prediction.
Basic Usage¶
predict_67baf29b is powered by Hydra.
== Configuration groups ==
Compose your configuration from those groups (group=option)
callbacks: default, early_stopping, model_checkpoint
common: default
hydra_config: predict, train
logger: csv, tensorboard, wandb
model: default, unet
paths: default, download, plot, postprocess, predict, preprocess, train
plot: default, ua700
postprocess: default, netcdf, plot_ua700
predict: default
preprocess: default
profiler: pytorch
train: default
trainer: default
== Config ==
Override anything in the config (foo.bar=value)
input:
name: ''
forecast_length: 3
lag_length: ${input.forecast_length}
vars:
absolute:
- sic
- tas
- tos
- ua2
- ua10
- ua50
- ua100
- ua250
- ua500
- ua700
anomaly:
- zg2
- zg10
- zg50
- zg100
- zg250
- zg500
- zg700
dates:
train:
start:
- 2000-1-5
end:
- 2000-1-5
val:
start:
- 2000-4-5
end:
- 2000-4-5
test:
start:
- 2000-2-1
end:
- 2000-2-1
predict:
start:
- 2000-2-1
end:
- 2000-2-1
reproject:
source_crs: EPSG:4326
target_crs: EPSG:6931
shape: 500
preprocess_era5:
implementation: canari_ml.data.processors.cds:ERA5PreProcessor
smooth_sigma: 0.5
preprocess_mask:
implementation: canari_ml.data.masks.era5:Masks
channel_name: hemisphere
preprocess_cache:
implementation: serial
output_batch_size: 4
workers: 16
frequency: DAY
hemisphere: north
params:
config_suffix: ${frequency}.${hemisphere}
config_name: ${hydra:job.name}
hashes:
reproject: ${compute_step_hash:[${input}, ${preprocess_type}, ${reproject}], ${input.name}}
preprocess_era5: ${compute_step_hash:[${hashes.reproject}, ${preprocess_era5}],
${input.name}}
preprocess_mask: ${compute_step_hash:[${hashes.preprocess_era5}, ${preprocess_mask}],
${input.name}}
cache: ${compute_step_hash:[${hashes.preprocess_mask}, ${preprocess_cache}], ${input.name}}
combined: ${compute_step_hash:[${hashes.preprocess_mask}], ${input.name}}
preprocess_type: predict
source_dataset_id: era5
paths:
download:
source_data_dir: ${hydra:runtime.cwd}/data
config_file: ${paths.download.source_data_dir}/data.aws.${frequency}.${hemisphere}.json
preprocess:
data_root: ${getcwd:}/preprocessed_data
preprocessed_data_dir: ${paths.preprocess.data_root}/preprocessed
reprojected_data_dir: ${paths.preprocess.preprocessed_data_dir}
normalised_data_dir: ${paths.preprocess.preprocessed_data_dir}
cache_dir: ${paths.preprocess.data_root}/cache
loader_file: ${paths.preprocess.preprocessed_data_dir}/loader.${params.config_name}.json
reproject:
destination_path: ${paths.preprocess.reprojected_data_dir}/01_reproject${opt_underscore:${input.name}}${opt_underscore:${hashes.reproject}}
config_file: ${paths.reproject.destination_path}/reproject.${params.config_suffix}.json
preprocess_era5:
destination_path: ${paths.preprocess.normalised_data_dir}/02_normalised${opt_underscore:${input.name}}${opt_underscore:${hashes.preprocess_era5}}
config_file: ${paths.preprocess_era5.destination_path}/processed_era5.${params.config_suffix}.json
mask:
destination_path: ${paths.preprocess_era5.destination_path}/mask${opt_underscore:${input.name}}${opt_underscore:${hashes.preprocess_mask}}
mask_dataset_config_path: ${paths.mask.destination_path}/dataset_config.masks.${params.config_suffix}.json
mask_config_path: ${paths.mask.destination_path}/processed_era5.masks.${params.config_suffix}.json
cache:
destination_path: ${paths.preprocess.cache_dir}/03_cache${opt_underscore:${input.name}}${opt_underscore:${hashes.cache}}
config_path: ${paths.cache.destination_path}/cached.${params.config_suffix}.json
train: outputs/${train.name}/training/
Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help
This displays the help menu for the prediction command.
Running the Prediction Process¶
To execute the prediction preprocessing using the defaults: