Tuning ====== Hyperparameter tuning is powered by [Ray Tune](https://docs.ray.io/en/latest/tune/index.html). We utilize a wrapper library, [lightray](https://github.com/ethanmarx/lightray), that simplifies the use of Ray Tune with the PyTorch Lightning's `LightningCLI` which is used by `Aframe`. ## Overview In short, the tuning pipeline will configure a `Ray` "cluster" consisting of worker processes (for executing the training trials) and a head process (for scheduling and orchestrating the trials). The beauty of `Ray` is that Once the head node and at least one work node are in the "Running" state, the ip address of the head node will be queried, and `lightray` will connect to the `Ray` cluster and launch the job. ```{eval-rst} .. note: Tuning jobs can be run with local resources, or on the Nautilus hypercluster. Please see the `ml4gw quickstart `_ to set up the necessary software and credentials to run on Nautilus. ``` ## Initialize a Tune Experiment Setting up a tuning pipeline can be done with the `aframe-init` command line tool. It is very similar to setting up the {doc}`Sandbox ` pipeline. ```console poetry run aframe-init offline --mode tune --directory ~/aframe/my-first-tune/ ``` Similar to the `Sandbox` pipeline, `.cfg`, `.yaml` and `run.sh` files that configure the tuning experiment will be instantiated in the experiment directory. ## Configuring a Tuning Experiment The `lightray` library provides an interface to configure most of the parameter exposed by the [`ray.tune.Tuner`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.Tuner.html) class. This includes the scheduler, search algorithm, and other ray tune specific configuration. These can be configured in the `tune.yaml` file. ### Search Parameter Space A key component of the tuning experiment is the paramter space that is searched over. The parameter space can be set via the `param_space` attribute in the `tune.yaml` file. ```yaml # tune.yaml param_space: model.learning_rate: tune.loguniform(1e-4, 1e-1), data.kernel_length: tune.choice([1, 2]) ``` the parameter names should be python "dot paths" corresponding to attributes in the `train.yaml` file. Any parameters set in the search space will be sampled from the specified distribution when each trial is launched, and override the value set in the `train.yaml`. ### Local Tuning If `AFRAME_TRAIN_RUN_DIR` is set to a local path, then tuning will be performed using local resources. A local ray cluster will be initialized, and trials will be distributed across available resources. ### Remote Tuning If `AFRAME_TRAIN_RUN_DIR` is set to an `s3://` path, then tuning will be performed on Nautilus, which uses [Kubernetes](https://kubernetes.io/) for scheduling access to resources. In short, a helm chart is used to launch worker pods and a head pod. Once at the head pod and at least one worker pod are `RUNNING`, the search is started. ```{eval-rst} .. note: It is recommended you are familiar with Kubernetes and Nautilus. If you are not, the Nautilus introduction `tutorial` `_ is a good place to start. ``` The number of worker pods, gpus per pod, and other resource configuration can be specified under the `[luigi_ray_worker]` and `[luigi_ray_head]` headers in the `tune.cfg` file. See the [ray worker](https://github.com/ML4GW/aframe/blob/main/aframe/config.py#L16) and [ray head](https://github.com/ML4GW/aframe/blob/main/aframe/config.py#L48) `luigi.Config` objects for all possible configuration. ```cfg [luigi_ray_worker] # request 4 worker pods replicas = 4 # number of gpus per replica gpus_per_replica = 2 [luigi_ray_head] cpus = 32 memory = 32G ``` The above configuration will create 4 worker pods, each with 2 gpus for a total of 8 gpus. ```{eval-rst} ..note: If for some reason the tuning job fails, you can simply re-launch the pipeline and `Ray` will automatically restart the trials from the latest experiment state ``` ### Syncing Remote Code In some cases, it is necessary to launch a tuning job with code changes that haven’t been integrated into the `Aframe` `main` branch, and thus have not been pushed to the remote container. To allow this, the `lightray/ray-cluster` `helm` chart supports an optional [git-sync](https://github.com/kubernetes/git-sync) `initContainer` that will clone and mount remote code from github inside the kubernetes pods. The remote repository and reference can be configured under the `[luigi_TuneRemote]` header in the `tune.cfg` ```cfg [luigi_TuneRemote] ... # path to remote git repository git_url = git@github.com:albert.einstein/aframe.git # reference (e.g. branch or commit) to checkout git_ref = my-feature ``` ```{eval-rst} .. important:: The git-sync :code:`initContainer` uses your ssh key to clone software from github. To do so, a Kubernetes secret is made to mount your ssh key into the container. By default, :code:`Aframe` will automatically pull your ssh key from :code:`~/.ssh/id_rsa`. You can override this default under the :code:`[luigi_ssh]` header .. code-block:: ini [luigi_ssh] ssh_file = ~/.ssh/id_ed25519 ``` ## Restoring an Experiment