Remote Training
Note
It is recommended you are familiar with running a local pipeline before proceeding
Important
For remote training, you must have completed ml4gw quickstart instructions, or installed the equivalent software.
Specifically, configuring s3 and Kubernetes for access to the nautilus hypercluster is required.
It is also recommended that you are familiar with Nautilus and Kubernetes.
If you are not, the Nautilus introduction tutorial
is a good place to start.
Remote experiments can be initialized using the aframe-init command line tool.
To initialize an experiment directory for a remote run, specify the --s3-bucket argument aframe-init.
poetry run aframe-init offline --mode sandbox --directory ~/aframe/my-first-remote-run --s3-bucket s3://my-bucket/my-first-remote-run
This will configure the AFRAME_TRAIN_RUN_DIR and AFRAME_TRAIN_DATA_DIR in the run.sh to point to the specified remote s3 bucket.
#!/bin/bash
# Export environment variables
export AFRAME_TRAIN_DATA_DIR=s3://my-bucket/my-first-remote-run/data/train
export AFRAME_TEST_DATA_DIR=/home/albert.einstein/aframe/my-first-remote-run/data/test
export AFRAME_TRAIN_RUN_DIR=s3://my-bucket/my-first-remote-run/training
export AFRAME_CONDOR_DIR=/home/albert.einstein/aframe/my-first-remote-run/condor
export AFRAME_RESULTS_DIR=/home/albert.einstein/aframe/my-first-remote-run/results
export AFRAME_TMPDIR=/home/albert.einstein/aframe/my-first-remote-run/tmp/
# launch pipeline; modify the gpus, workers etc. to suit your needs
# note that if you've made local code changes not in the containers
# you'll need to add the --dev flag!
LAW_CONFIG_FILE=/home/albert.einstein/aframe/my-first-remote-run/sandbox.cfg poetry run --directory /home/albert.einstein/projects/aframev2 law run aframe.pipelines.sandbox.Sandbox --workers 5 --gpus 0
The luigi/law Tasks responsible for training data generation will automatically transfer your data to s3 storage, and launch a remote training job using kubernetes.
Configuring Remote Resources
The quantity of remote resources can be configured in the .cfg config file under the [luigi_Train] header
[luigi_Train]
...
request_gpus = 4 # number of gpus to request
cpus_per_gpu = 12 # cpus per gpu
memory_per_cpu = 1 # memory in GB
It is also possible to sync remote Aframe code from git into the container using an optional git-sync initContainer.
This is often useful when you are testing an idea that hasn’t made
it onto the Aframe main branch (and thus hasn’t been pushed to the remote container image). To do so, specify the following
in the .cfg.
[luigi_Train]
...
# use kubernetes initContainer to sync code
use_init_containers = True
# path to remote git repository
git_url = git@github.com:albert.einstein/aframev2.git
# reference (e.g. branch or commit) to checkout
git_ref = my-feature
Important
The git-sync initContainer uses your ssh key to clone software from github. To do so, a Kubernetes secret
is made to mount your ssh key into the container. By default, Aframe will automatically pull your ssh key from
~/.ssh/id_rsa. You can override this default under the [luigi_ssh] header
[luigi_ssh]
ssh_file = ~/.ssh/id_ed25519
Local Training with S3 Data
Sometimes there are instances where you have data that lives on an s3 filesystem, but you wish to train using local resources. To do so,
set AFRAME_TRAIN_RUN_DIR to a local path and AFRAME_TRAIN_DATA_DIR to an s3:// location. The training project will detect that the specified data
lives on s3, and download it.
#!/bin/bash
# remote s3 data
export AFRAME_TRAIN_DATA_DIR=s3://my-bucket/remote-data-local-training/data/train
# local training
export AFRAME_TRAIN_RUN_DIR=/home/albert.einstein/remote-data-local-training/training
...