Data
====
Scripts for producing training and testing data for `Aframe`

## Environment
The `data` project environment utilizes `Mamba` and `poetry`. `Mamba` is needed for installing
the LIGO frame reading libraries [`python-ldas-tools-framecpp`](https://anaconda.org/conda-forge/python-ldas-tools-framecpp/) and [https://anaconda.org/conda-forge/python-nds2-client](`python-nds2-client`), which are unavailable on PyPi.

In the root of the `data` project, run 
```bash
apptainer build $AFRAME_CONTAINER_ROOT/data.sif apptainer.def
```
to build the `data` container. 

The container will first build an environment using the [`conda-lock.yml`](./conda-lock.yml), and then install local dependencies defined in the [`pyproject.toml`](./pyproject.toml).

If the dependencies in the [`environment.yaml`](./environment.yaml) require modifications, the `conda-lock.yml` will need to be updated

```bash
conda-lock -f environment.yaml -p linux-64
```

and the container image will need to be rebuilt.

## Scripts
The data project consists of four main sub-modules:

1. `data/segments` - Querying science mode segments
2. `data/fetch` - Fetching strain data
3. `data/timeslide_waveforms` - Generating waveforms for injection campaigns
4. `data/waveforms` - Generating waveforms for training Aframe

Additionally, the main executable of each sub-module is exposed via a CLI at `data/cli.py`

## Example: generating training data
As an example, let's build a training dataset using the CLI in the `data` container we built above

First, let's make a data storage directory, and query science mode segments from [gwosc](gwosc.org)
```bash
mkdir -p ~/aframe/data/train/background
apptainer run $AFRAME_CONTAINER_ROOT/data.sif \
    python -m data query --flags='["H1_DATA", "L1_DATA"]' --start 1240579783 --end 1241443783 --output_file ~/aframe/data/segments.txt
```

Inspecting the output, (`vi ~/aframe/data/segments.txt`) it looks like there are science mode data segments between `(1240579783, 1240587612)` and `(1240594562, 1240606748)`. 

Next, let's fetch strain data during those segments. One will be used for training, the other for validating

```bash
apptainer run $AFRAME_CONTAINER_ROOT/data.sif \
    python -m data fetch \
    --start 1240579783 \
    --end 1240587612 \
    --channels='["H1", "L1"]' \
    --sample_rate 2048 \
    --output_directory ~/aframe/data/train/background/

apptainer run $AFRAME_CONTAINER_ROOT/data.sif \
    python -m data fetch \
    --start 1240594562 \
    --end 1240606748 \
    --channels='["H1", "L1"]' \
    --sample_rate 2048 \
    --output_directory ~/aframe/data/train/background/
```

Finally, lets generate some waveforms for training

```bash
apptainer run $AFRAME_CONTAINER_ROOT/data.sif \
    python -m data training_waveforms \
    --num_signals 10000 \
    --waveform_duration 8 \
    --sample_rate 2048 \
    --prior priors.priors.end_o3_ratesandpops \
    --minimum_frequency 20 \
    --reference_frequency 50 \
    --waveform_approximant IMRPhenomXPHM \
    --coalescence_time 6 \
    --output_file ~/aframe/data/train/train_waveforms.hdf5
```

and validation. Note that this uses one of the background files downloaded above.

```bash
apptainer run $AFRAME_CONTAINER_ROOT/data.sif \
    python -m data validation_waveforms \
    --num_signals 2000 \
    --prior priors.priors.end_o3_ratesandpops \
    --ifos='["H1", "L1"]' \
    --minimum_frequency 20 \
    --reference_frequency 50 \
    --sample_rate 2048 \
    --waveform_duration 8 \
    --waveform_approximant IMRPhenomXPHM \
    --coalescence_time 6 \
    --highpass 32 \
    --snr_threshold 4 \
    --psd ~/aframe/data/train/background/background-1240579783-7829.hdf5
    --output_file ~/aframe/data/train/val_waveforms.hdf5
```

Note that the train project assumes these waveform files are named as above! To continue this example, see the [training `Aframe` example](../train/README.md#example-training-aframe)