Configuration Parameters

This page describes the configuration parameters used in the YAML file for controlling the training and evaluation of the model.

This is what a standard configuration looks like:

# lightning.pytorch==2.1.2
seed_everything: true
trainer:
  accelerator: gpu
  strategy: auto
  devices: auto
  num_nodes: 1
  precision: null
  logger:
    class_path: lightning.pytorch.loggers.WandbLogger
    init_args:
      name: null
      save_dir: null
      version: null
      offline: false
      dir: null
      id: null
      anonymous: null
      project: project-unsat
      log_model: false
      experiment: null
      prefix: ''
      checkpoint_name: null
      job_type: null
      config: null
      entity: null
      reinit: null
      tags: null
      group: null
      notes: null
      magic: null
      config_exclude_keys: null
      config_include_keys: null
      mode: null
      allow_val_change: null
      resume: null
      force: null
      tensorboard: null
      sync_tensorboard: null
      monitor_gym: null
      save_code: null
      settings: null
  callbacks:
  - unsat.callbacks.ClassWeightsCallback
  - class_path: unsat.callbacks.CheckFaultsCallback
    init_args:
      patch_size: 64
  fast_dev_run: false
  max_epochs: 1000
  min_epochs: null
  max_steps: -1
  min_steps: null
  max_time: null
  limit_train_batches: null
  limit_val_batches: null
  limit_test_batches: null
  limit_predict_batches: null
  overfit_batches: 0.0
  val_check_interval: null
  check_val_every_n_epoch: 1
  num_sanity_val_steps: null
  log_every_n_steps: 1
  enable_checkpointing: null
  enable_progress_bar: null
  enable_model_summary: null
  accumulate_grad_batches: 1
  gradient_clip_val: null
  gradient_clip_algorithm: null
  deterministic: null
  benchmark: null
  inference_mode: true
  use_distributed_sampler: true
  profiler: null
  detect_anomaly: false
  barebones: false
  plugins: null
  sync_batchnorm: false
  reload_dataloaders_every_n_epochs: 0
  default_root_dir: null
model:
  network:
    class_path: unsat.models.UNet
    init_args:
      start_channels: 2
      num_blocks: 3
      kernel_size: 3
      block_depth: 2
      batch_norm: true
  optimizer:
    class_path: torch.optim.Adam
    init_args:
      lr: 3e-3
data:
  hdf5_path: /projects/0/einf3381/UNSAT/data/experimental.h5
  faults_path: faults/faults.yaml
  class_names:
  - "water"
  - "background"
  - "air"
  - "root"
  - "soil"
  input_channels: 1
  train_samples:
  - maize/coarse/loose
  - maize/fine/dense
  height_range:
  - 1000
  - 1100
  train_day_range:
  - 2
  - 3
  validation_split: 0.1
  seed: 42
  batch_size: 4
  num_workers: 2
  dimension: 2
  patch_size: 512
  patch_border: 16
ckpt_path: null

Configuration Parameters

Explanation of Configuration Parameters

seed_everything: Ensures all random number generators are seeded to enable reproducibility. (true or false)

Trainer Configuration

trainer.accelerator: Specifies the hardware to use for training (gpu, cpu, etc.).
trainer.strategy: Automatically determines the best distributed training strategy.
trainer.devices: Sets the number of devices to use. auto uses all available devices.
trainer.num_nodes: Specifies the number of nodes for distributed training.
trainer.precision: Defines the floating-point precision. null uses the default (32-bit).

Logger Settings

trainer.logger.class_path: Specifies the logger to use. This setup uses the WandbLogger.
trainer.logger.init_args: Arguments for initializing the logger.
name: Name for the run. null auto-generates a name.
save_dir: Directory to save logs.
version: Version number for the logger.
offline: Enables offline mode for logging.
dir: Directory for logs.
id: ID for resuming a run.
anonymous: Logs anonymously if set to true.
project: Project name. Defaults to project-unsat.
log_model: Indicates whether to log model checkpoints.
experiment: Name or identifier for the experiment.
prefix: Prefix for run names.
checkpoint_name: Name of the checkpoint.
job_type: Specifies the job type (e.g., training, validation).
config: Logs a configuration dictionary.
entity: W&B entity or team name.
reinit: Allows re-initialization if set to true.
tags: Tags for the run.
group: Group name for organizing runs.
notes: Additional notes about the run.
magic: Magic commands.
config_exclude_keys: Excludes specific keys from the config logging.
config_include_keys: Includes specific keys in the config logging.
mode: Sets the mode for the logger.
allow_val_change: Allows changing validation configuration if true.
resume: Resumes a previous run if true.
force: Forces overwriting an existing run.
tensorboard: Configures TensorBoard integration.
sync_tensorboard: Synchronizes TensorBoard with the logger.
monitor_gym: Monitors the gym environment if true.
save_code: Saves the code related to the run if true.
settings: Additional logger settings.

Callbacks

trainer.callbacks: A list of callbacks used during training.
unsat.callbacks.ClassWeightsCallback: Adjusts class weights dynamically.
unsat.callbacks.CheckFaultsCallback: Monitors for faults during training.
- init_args.patch_size: Patch size for fault checking, set to 64.

Additional Trainer Settings

trainer.fast_dev_run: Runs a single batch for debugging if true.
trainer.max_epochs: Maximum number of epochs, set to 1000.
trainer.min_epochs: Minimum number of epochs. null means no minimum.
trainer.max_steps: Maximum training steps, set to -1 to disable.
trainer.min_steps: Minimum number of steps. null means no minimum.
trainer.max_time: Limits the maximum training time.
trainer.limit_train_batches: Limits the number of training batches per epoch.
trainer.limit_val_batches: Limits the number of validation batches per epoch.
trainer.limit_test_batches: Limits the number of test batches.
trainer.limit_predict_batches: Limits the number of prediction batches.
trainer.overfit_batches: Fraction of data to overfit for debugging, set to 0.0.
trainer.val_check_interval: How often to validate, in terms of training epochs.
trainer.check_val_every_n_epoch: How often to perform validation checks, in epochs.
trainer.num_sanity_val_steps: Number of steps for sanity check validation.
trainer.log_every_n_steps: Frequency of logging, set to 1 for every step.
trainer.enable_checkpointing: Enables checkpointing.
trainer.enable_progress_bar: Shows a progress bar if true.
trainer.enable_model_summary: Displays a summary of the model if true.
trainer.accumulate_grad_batches: Number of batches over which gradients are accumulated, set to 1.
trainer.gradient_clip_val: Clipping value for gradients.
trainer.gradient_clip_algorithm: Algorithm used for gradient clipping.
trainer.deterministic: Ensures deterministic training if true.
trainer.benchmark: Enables benchmarking for better performance.
trainer.inference_mode: Enables inference mode, optimizing evaluation.
trainer.use_distributed_sampler: Uses a distributed sampler for data loading.
trainer.profiler: Profiler for performance analysis.
trainer.detect_anomaly: Enables anomaly detection if true.
trainer.barebones: If true, runs with minimal features.
trainer.plugins: Specifies additional plugins to use.
trainer.sync_batchnorm: Synchronizes batch normalization across devices if true.
trainer.reload_dataloaders_every_n_epochs: Reloads data loaders after a specified number of epochs.
trainer.default_root_dir: Root directory for saving logs and checkpoints.

Model Configuration

model.network.class_path: Path to the model class, set to unsat.models.UNet.
model.network.init_args: Initialization arguments for the model.
start_channels: Number of starting channels, set to 2.
num_blocks: Number of blocks in the model, set to 3.
kernel_size: Size of the convolution kernels, set to 3.
block_depth: Depth of each block, set to 2.
batch_norm: Enables batch normalization if true.

Optimizer Configuration

model.optimizer.class_path: Path to the optimizer class, set to torch.optim.Adam.
model.optimizer.init_args: Initialization arguments for the optimizer.
lr: Learning rate, set to 3e-3.

Data Configuration

data.hdf5_path: Path to the HDF5 file containing the dataset.
data.faults_path: Path to the YAML file specifying faults.
data.class_names: List of class names for classification:
water, background, air, root, soil
data.input_channels: Number of input channels, set to 1.
data.train_samples: List of paths to training samples:
maize/coarse/loose, maize/fine/dense
data.height_range: Range of heights to consider, from 1000 to 1100.
data.train_day_range: Days to include in the training set, from 2 to 3.
data.validation_split: Fraction of the data to use for validation, set to 0.1.
data.seed: Random seed for data shuffling, set to 42.
data.batch_size: Batch size, set to 4.
data.num_workers: Number of workers for data loading, set to 2.
data.dimension: Dimensionality of the data, set to 2.
data.patch_size: Patch size for data extraction, set to 512.
data.patch_border: Border size around each patch, set to 16.

Checkpoint Path

ckpt_path: Path to a checkpoint for resuming training, set to null to start fresh.