SLURM Submission¶
Running jobs on clusters is relatively straightforward.
Preprocessing¶
As an example, for preprocessing:
canari_ml preprocess train 'hydra.searchpath=[file://.]' hydra/launcher=slurm_jasmin_cpu -cd configs/preprocess/ -cn train_demo_dataset -m
- The
-mflag is used to submit the job to SLURM. 'hydra.searchpath=[file://.]'is used to define where additional config files should be found. This lets you define config paths relative to your current location. In this case,hydra/launcher=slurm_jasmin_cpu.yaml.
where, this custom config file is used to define the job submission on a SLURM cluster. As an example, for the JASMIN LOTUS2 cluster, you would use the following config file:
hydra/launcher/slurm_jasmin_cpu.yaml
# @package hydra.launcher
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/${hydra.sweep.subdir}/slurm/%j
timeout_min: 1440 #(1)!
cpus_per_task: 1
tasks_per_node: 1
mem_gb: 256
nodes: 1
name: ${hydra.job.name}
partition: standard
qos: high
comment: null
mem_per_cpu: null
account: canari
max_num_timeout: 0
setup: #(2)!
- echo "Hello World"
- Duration to run for in minutes, in this case, 24 hours.
- List of set-up commands to run before the job starts.
Training¶
And, for submitting a training run on a GPU partition:
canari_ml train 'hydra.searchpath=[file://.]' hydra/launcher=slurm_jasmin_cpu train.dataset=preprocessed_data/train_demo_dataset/03_cache_demo_dataset/cached.DAY.north.json train.name=demo_train train.epochs=2
hydra/launcher/slurm_jasmin_gpu.yaml
# @package hydra.launcher
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/${hydra.sweep.subdir}/slurm/%j
timeout_min: 1440
cpus_per_task: 4
tasks_per_node: 1
mem_gb: 256
nodes: 1
name: ${hydra.job.name}
partition: orchid
qos: orchid
comment: null
gres: "gpu:1" # (1)!
mem_per_cpu: null
account: canari
max_num_timeout: 0
setup:
- echo "Hello World"
- This defines how many GPU resources you are requesting for training.