Remote Training

Note

It is recommended you are familiar with running a local pipeline before proceeding

Important

For remote training, you must have completed ml4gw quickstart instructions, or installed the equivalent software. Specifically, configuring s3 and Kubernetes for access to the nautilus hypercluster is required. It is also recommended that you are familiar with Nautilus and Kubernetes. If you are not, the Nautilus introduction tutorial is a good place to start.

Remote experiments can be initialized using the aframe-init command line tool. To initialize an experiment directory for a remote run, specify the --s3-bucket argument aframe-init.

poetry run aframe-init offline --mode sandbox --directory ~/aframe/my-first-remote-run --s3-bucket s3://my-bucket/my-first-remote-run

This will configure the AFRAME_TRAIN_RUN_DIR and AFRAME_TRAIN_DATA_DIR in the run.sh to point to the specified remote s3 bucket.

#!/bin/bash
# Export environment variables
export AFRAME_TRAIN_DATA_DIR=s3://my-bucket/my-first-remote-run/data/train
export AFRAME_TEST_DATA_DIR=/home/albert.einstein/aframe/my-first-remote-run/data/test
export AFRAME_TRAIN_RUN_DIR=s3://my-bucket/my-first-remote-run/training
export AFRAME_CONDOR_DIR=/home/albert.einstein/aframe/my-first-remote-run/condor
export AFRAME_RESULTS_DIR=/home/albert.einstein/aframe/my-first-remote-run/results
export AFRAME_TMPDIR=/home/albert.einstein/aframe/my-first-remote-run/tmp/

# launch pipeline; modify the gpus, workers etc. to suit your needs
# note that if you've made local code changes not in the containers
# you'll need to add the --dev flag!
LAW_CONFIG_FILE=/home/albert.einstein/aframe/my-first-remote-run/sandbox.cfg poetry run --directory /home/albert.einstein/projects/aframev2 law run aframe.pipelines.sandbox.Sandbox --workers 5 --gpus 0

The luigi/law Tasks responsible for training data generation will automatically transfer your data to s3 storage, and launch a remote training job using kubernetes.

Configuring Remote Resources

The quantity of remote resources can be configured in the .cfg config file under the [luigi_Train] header

[luigi_Train]
...
request_gpus = 4 # number of gpus to request
cpus_per_gpu = 12 # cpus per gpu
memory_per_cpu = 1 # memory in GB

It is also possible to sync remote Aframe code from git into the container using an optional git-sync initContainer. This is often useful when you are testing an idea that hasn’t made it onto the Aframe main branch (and thus hasn’t been pushed to the remote container image). To do so, specify the following in the .cfg.

[luigi_Train]
...
# use kubernetes initContainer to sync code
use_init_containers = True
# path to remote git repository
git_url = git@github.com:albert.einstein/aframev2.git
# reference (e.g. branch or commit) to checkout
git_ref = my-feature

Important

The git-sync initContainer uses your ssh key to clone software from github. To do so, a Kubernetes secret is made to mount your ssh key into the container. By default, Aframe will automatically pull your ssh key from ~/.ssh/id_rsa. You can override this default under the [luigi_ssh] header

[luigi_ssh]
ssh_file = ~/.ssh/id_ed25519

Local Training with S3 Data

Sometimes there are instances where you have data that lives on an s3 filesystem, but you wish to train using local resources. To do so, set AFRAME_TRAIN_RUN_DIR to a local path and AFRAME_TRAIN_DATA_DIR to an s3:// location. The training project will detect that the specified data lives on s3, and download it.

#!/bin/bash

# remote s3 data 
export AFRAME_TRAIN_DATA_DIR=s3://my-bucket/remote-data-local-training/data/train
# local training 
export AFRAME_TRAIN_RUN_DIR=/home/albert.einstein/remote-data-local-training/training
...