Tuning

Hyperparameter tuning is powered by Ray Tune. We utilize a wrapper library, lightray, that simplifies the use of Ray Tune with the PyTorch Lightning’s LightningCLI which is used by Aframe.

Overview

In short, the tuning pipeline will configure a Ray “cluster” consisting of worker processes (for executing the training trials) and a head process (for scheduling and orchestrating the trials). The beauty of Ray is that

Once the head node and at least one work node are in the “Running” state, the ip address of the head node will be queried, and lightray will connect to the Ray cluster and launch the job.

Initialize a Tune Experiment

Setting up a tuning pipeline can be done with the aframe-init command line tool. It is very similar to setting up the Sandbox pipeline.

poetry run aframe-init offline  --mode tune --directory ~/aframe/my-first-tune/ 

Similar to the Sandbox pipeline, .cfg, .yaml and run.sh files that configure the tuning experiment will be instantiated in the experiment directory.

Configuring a Tuning Experiment

The lightray library provides an interface to configure most of the parameter exposed by the ray.tune.Tuner class. This includes the scheduler, search algorithm, and other ray tune specific configuration. These can be configured in the tune.yaml file.

Search Parameter Space

A key component of the tuning experiment is the paramter space that is searched over. The parameter space can be set via the param_space attribute in the tune.yaml file.

# tune.yaml

param_space:
    model.learning_rate: tune.loguniform(1e-4, 1e-1),
    data.kernel_length: tune.choice([1, 2])

the parameter names should be python “dot paths” corresponding to attributes in the train.yaml file. Any parameters set in the search space will be sampled from the specified distribution when each trial is launched, and override the value set in the train.yaml.

Local Tuning

If AFRAME_TRAIN_RUN_DIR is set to a local path, then tuning will be performed using local resources. A local ray cluster will be initialized, and trials will be distributed across available resources.

Remote Tuning

If AFRAME_TRAIN_RUN_DIR is set to an s3:// path, then tuning will be performed on Nautilus, which uses Kubernetes for scheduling access to resources. In short, a helm chart is used to launch worker pods and a head pod. Once at the head pod and at least one worker pod are RUNNING, the search is started.

The number of worker pods, gpus per pod, and other resource configuration can be specified under the [luigi_ray_worker] and [luigi_ray_head] headers in the tune.cfg file. See the ray worker and ray head luigi.Config objects for all possible configuration.

[luigi_ray_worker]
# request 4 worker pods
replicas = 4 
# number of gpus per replica
gpus_per_replica = 2


[luigi_ray_head]
cpus = 32
memory = 32G

The above configuration will create 4 worker pods, each with 2 gpus for a total of 8 gpus.

..note:

If for some reason the tuning job fails, you can simply re-launch the pipeline and Ray will automatically restart the trials from the latest experiment state

Syncing Remote Code

In some cases, it is necessary to launch a tuning job with code changes that haven’t been integrated into the Aframe main branch, and thus have not been pushed to the remote container. To allow this, the lightray/ray-cluster helm chart supports an optional git-sync initContainer that will clone and mount remote code from github inside the kubernetes pods. The remote repository and reference can be configured under the [luigi_TuneRemote] header in the tune.cfg

[luigi_TuneRemote]
...
# path to remote git repository
git_url = git@github.com:albert.einstein/aframe.git
# reference (e.g. branch or commit) to checkout
git_ref = my-feature

Important

The git-sync initContainer uses your ssh key to clone software from github. To do so, a Kubernetes secret is made to mount your ssh key into the container. By default, Aframe will automatically pull your ssh key from ~/.ssh/id_rsa. You can override this default under the [luigi_ssh] header

[luigi_ssh]
ssh_file = ~/.ssh/id_ed25519

Restoring an Experiment