Train

Training Aframe networks using PyTorch Lightning and hyper-parameter tuning using Ray Tune

Environment

The train project environment is manged by poetry.

In the root of the train project, run

apptainer build $AFRAME_CONTAINER_ROOT/train.sif apptainer.def

to build the train container.

This project can also be installed locally via

poetry install

Scripts

The train project consists of two main executables

train - launch a single training job train.tune - launch distributed hyper-parameter tuning using Ray

Train

The training script takes advantage of LightningCLI allowing for modularity and flexibility. One single training script supports

  • Time domain and Frequency domain data representations

  • Supervised and Semi-supervised optimization schemes

all by changing a configuration file. This is achieved by using a class hierarchy of DataModules and LightningModules where core functionality common to all use-cases is abstracted into base classes.

To see a list of arguments one can locally run

poetry run python -m train --help

or inside the container

apptainer run $AFRAME_CONTAINER_ROOT/train.sif python -m train --help

This list is quite exhaustive. It is suggested that you start from the default configuration file.

Example: Training Aframe

Note It is assumed you have generated a training dataset via the data project example

The following will a training run using GPU 0

mkdir ~/aframe/results
APPTAINERENV_CUDA_VISIBLE_DEVICES=0 apptainer run --nv $AFRAME_CONTAINER_ROOT/train.sif \
    python -m train \
        --config /opt/aframe/projects/train/config.yaml \
        --data.ifos=[H1,L1] \
        --data.data_dir ~/aframe/data/train \
        --trainer.logger=WandbLogger \
        --trainer.logger.project=aframe \
        --trainer.logger.name=my-first-run \
        --trainer.logger.save_dir=~/aframe/results/my-first-run

This will infer most of your training arguments from the YAML config that got put into the container at build time. If you want to change this config, or if you change any code and you want to see those changes reflected inside the container, you can “bind” your local version of the root Aframe repository into the container by including apptainer run --bind .:/opt/aframe at the beginning of the above command.

You can even train using multiple GPUS for free! Just specify a list of comma-separated GPU indices to APPTAINERENV_CUDA_VISIBLE_DEVICES.

Weights & Biases (WandB)

Aframe uses WandB for experiment tracking. WandB already has built-in integration with lightning.

You can assign various attributes to your W&B logger in the luigi.cfg file present in aframe’s root directory

  • name: name the run will be assigned

  • group: group to which the run will be assigned. This is useful for runs that are part of the same experiment but execute in different scripts, e.g. a hyperparameter sweep or maybe separate train, inferenence, and evaluation scripts

  • tags: comma-separated list of tags to give your run. Makes it easy to filter in the dashboard e.g. for autoencoder runs

  • project: the workspace consisting of multiple related experiments that your run is a part of, by default this is set to aframe

  • entity: the group managing the experiments your run is associated, e.g. ml4gw. By default, this is left blank, resulting in the project and run being associated with your personal account

Note All the attributes above can also be configured via environment variables

Once your run is started, you can go to wandb.ai and track your loss and validation score. If you don’t want to track your run with W&B, just remove the first three --trainer arguments above. This will save your training metrics to a local CSV in the save_dir.

Tune

In addition, the train project consists of a tuning script for performing a distributed hyper-parameter search with Ray Tune. It is recommended that multiple GPU’s are available for an efficient search.

A local tune job can be launched with

APPTAINERENV_CUDA_VISIBLE_DEVICES=<IDs of GPUs to tune on> apptainer run --nv $AFRAME_CONTAINER_ROOT/train.sif \
    python -m train.tune \
        --config /opt/aframe/projects/train/config.yaml
        --data.ifos=[H1,L1]
        --data.data_dir ~/aframe/data/train
        --trainer.logger=WandbLogger
        --trainer.logger.project=aframe
        --trainer.logger.save_dir=~/aframe/results/my-first-tune \
        --tune.name my-first-tune \
        --tune.storage_dir ~/aframe/results/my-first-tune \
        --tune.temp_dir ~/aframe/results/my-first-tune/ray \
        --tune.num_samples 8 \
        --tune.cpus_per_gpu 6 \
        --tune.gpus_per_worker 1 \
        --tune.num_workers 4

This will launch 8 hyperparameter search jobs that will execute on 4 workers using the Asynchronous Successive Halving Algorithm (ASHA). All the runs will be given the same group ID in W&B, and will be assigned random names in that group.

NOTE: for some reason, right now this will launch one job at a time that takes all available GPUs. This needs sorting out

If you already have a ray cluster running somewhere, you can distribute your jobs over that cluster by simply adding the --tune.endpoint <ip address of ray cluster>:10001 command line argument.

Similarly, to see a list of arguments one can locally run

poetry run python -m train.tune --help

or inside the container

apptainer run $AFRAME_CONTAINER_ROOT/train.sif python -m train.tune --help