Tuning
======
Hyperparameter tuning is powered by [Ray Tune](https://docs.ray.io/en/latest/tune/index.html). We utilize a wrapper library, [lightray](https://github.com/ethanmarx/lightray), that simplifies the use of Ray Tune with the PyTorch Lightning's `LightningCLI` which is used by `Aframe`.
## Overview
In short, the tuning pipeline will configure a `Ray` "cluster" consisting of worker processes (for executing the training trials)
and a head process (for scheduling and orchestrating the trials). The beauty of `Ray` is that
Once the head node and at least one work node are in the "Running" state, the ip address of the head node will be queried, and `lightray` will connect to the `Ray`
cluster and launch the job.
```{eval-rst}
.. note:
Tuning jobs can be run with local resources, or on the Nautilus hypercluster. Please see the `ml4gw quickstart `_ to
set up the necessary software and credentials to run on Nautilus.
```
## Initialize a Tune Experiment
Setting up a tuning pipeline can be done with the `aframe-init` command line tool. It is very similar to setting up the {doc}`Sandbox ` pipeline.
```console
poetry run aframe-init offline --mode tune --directory ~/aframe/my-first-tune/
```
Similar to the `Sandbox` pipeline, `.cfg`, `.yaml` and `run.sh` files that configure the tuning experiment will be instantiated in the experiment directory.
## Configuring a Tuning Experiment
The `lightray` library provides an interface to configure most of the parameter exposed by the [`ray.tune.Tuner`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.Tuner.html) class. This includes the scheduler, search algorithm, and other ray tune specific configuration. These can be configured in the `tune.yaml` file.
### Search Parameter Space
A key component of the tuning experiment is the paramter space that is searched over. The parameter space can be set
via the `param_space` attribute in the `tune.yaml` file.
```yaml
# tune.yaml
param_space:
model.learning_rate: tune.loguniform(1e-4, 1e-1),
data.kernel_length: tune.choice([1, 2])
```
the parameter names should be python "dot paths" corresponding to attributes in the `train.yaml` file. Any
parameters set in the search space will be sampled from the specified distribution
when each trial is launched, and override the value set in the `train.yaml`.
### Local Tuning
If `AFRAME_TRAIN_RUN_DIR` is set to a local path, then tuning will be performed using local resources. A local ray cluster
will be initialized, and trials will be distributed across available resources.
### Remote Tuning
If `AFRAME_TRAIN_RUN_DIR` is set to an `s3://` path, then tuning will be performed on Nautilus, which uses [Kubernetes](https://kubernetes.io/) for scheduling access to resources. In short, a helm chart is used to launch worker pods and a head pod. Once at the head pod and at least one worker pod are `RUNNING`, the search is started.
```{eval-rst}
.. note:
It is recommended you are familiar with Kubernetes and Nautilus.
If you are not, the Nautilus introduction `tutorial` `_
is a good place to start.
```
The number of worker pods, gpus per pod, and other resource configuration can be specified under the
`[luigi_ray_worker]` and `[luigi_ray_head]` headers in the `tune.cfg` file. See the [ray worker](https://github.com/ML4GW/aframe/blob/main/aframe/config.py#L16)
and [ray head](https://github.com/ML4GW/aframe/blob/main/aframe/config.py#L48) `luigi.Config` objects for all possible configuration.
```cfg
[luigi_ray_worker]
# request 4 worker pods
replicas = 4
# number of gpus per replica
gpus_per_replica = 2
[luigi_ray_head]
cpus = 32
memory = 32G
```
The above configuration will create 4 worker pods, each with 2 gpus for a total of 8 gpus.
```{eval-rst}
..note:
If for some reason the tuning job fails, you can simply re-launch the pipeline and `Ray` will automatically restart the trials
from the latest experiment state
```
### Syncing Remote Code
In some cases, it is necessary to launch a tuning job with code changes that haven’t been integrated into the `Aframe` `main` branch, and thus have not been pushed to the remote container. To allow this, the `lightray/ray-cluster` `helm` chart supports an optional [git-sync](https://github.com/kubernetes/git-sync) `initContainer` that will clone and mount remote code from github inside the kubernetes pods. The remote repository and reference can be configured under the `[luigi_TuneRemote]` header in the `tune.cfg`
```cfg
[luigi_TuneRemote]
...
# path to remote git repository
git_url = git@github.com:albert.einstein/aframe.git
# reference (e.g. branch or commit) to checkout
git_ref = my-feature
```
```{eval-rst}
.. important::
The git-sync :code:`initContainer` uses your ssh key to clone software from github. To do so, a Kubernetes secret
is made to mount your ssh key into the container. By default, :code:`Aframe` will automatically pull your ssh key from
:code:`~/.ssh/id_rsa`. You can override this default under the :code:`[luigi_ssh]` header
.. code-block:: ini
[luigi_ssh]
ssh_file = ~/.ssh/id_ed25519
```
## Restoring an Experiment