This package automates the process of generating large amounts of data, providing a clean interface between your simulation and the SLURM workload manager. It also manages the datasets you choose to generate, and allows easy access to cached simulations that load quickly. If you need more data than you have, SLURM_gen lets you know how many more samples need to be generated, and how much compute time it will take.
Installation
pip install -e . # don't forget the period
Usage
SLURM_gen provides a simple command line interface to
- generate data samples,
- assign those samples to a particular dataset name, like 'train' or 'test', and
- track the number of samples generated for various datasets and parameters.
You can define your own datasets simply by writing a function that outputs feature-label pairs. Define that function in a file called datasets.py
, and point SLURM_gen at the directory containing that file.
Example
Here we'll show how to define a simple dataset, generate some samples, and access them.
Define the generator
Start by using the DefaultParamObject
class and the @dataset
decorator to define a new dataset. These definitions should be placed in a Python file called datasets.py
.
# example/datasets.py
import math
import random
from slurm_gen import DefaultParamObject, dataset
class NoisySineParams(DefaultParamObject):
"""Attributes defining parameters to the noisy_sine experiment."""
# leftmost allowed value for x
left = -1
# rightmost allowed value for x
right = 1
# standard deviation of noise to add to sin(x)
std_dev = 0.1
# we can specify extra SLURM batch parameters here
options = "--qos=test"
# here we also tell SLURM_gen to request 1GB of memory and save every 50 samples
@dataset(NoisySineParams, "1GB", 50, options)
def noisy_sine(size, params):
"""Create samples from a noisy sine wave.
Args:
size (int): number of samples to generate.
params (NoisySineParams): parameters to the experiment.
Yields:
(float): x-value.
(float): y-value plus noise.
"""
for _ in range(size):
x = random.uniform(params.left, params.right)
yield x, math.sin(x) + random.normalvariate(mu=0, sigma=params.std_dev)
The NoisySineParams
defines the possible configuration parameters that the generator can accept, as well as the default values for those parameters. When generating or accessing samples, we can specify non-default values for any of these parameters.
The @dataset
decorator converts noisy_sine
into a dataset which can be used by the slurm_gen.generate
module to create cache files containing arbitrary numbers of samples. We can define as many functions as we like in datasets.py
, and all those marked with @dataset
will be usable in SLURM_gen.
Generate samples
Now that we've defined the generator, we can generate some samples for that dataset like this:
cd example/ # the directory containing datasets.py
python -m slurm_gen.generate noisy_sine -n 1000 --njobs 3 --time "10"
python -m slurm_gen.generate noisy_sine -n 1000 --njobs 3 --params "{'left': 0, 'std_dev': 0.5}"
In the first example above, we submitted 3 SLURM jobs, splitting the 1000 samples evenly among them. Since we had no samples for this dataset yet, we had to provide --time
. In the second example, we omitted the --time
argument, and a time duration three standard deviations above the mean of previous runs was used, adapted to the number of samples per job. In the second example we also set some configuration parameters to non-default values.
Managing samples
We can list the available samples from the command line:
cd example/
python -m slurm_gen.list
The output will look like this:
noisy_sine:
Param set #0:
left#-1|right#1| raw: 1000
std_dev#0.1|
Param set #1:
left#0|right#1| raw: 1000
std_dev#0.5|
We can see the samples for the "noisy_sine" dataset divided into sets by the parameters given.
If we want to move some of those samples into a group labeled "train", we can do so like this:
cd example/
python -m slurm_gen.move noisy_sine 700 train -p 0
The -p
argument identifies which parameter set to use. You can also use a dictionary of values as the identifier, by passing a string that will be evaluated as a dictionary.
After the move, the output of python -m slurm_gen.list
will be
noisy_sine:
Param set #0:
left#-1|right#1| raw: 300
std_dev#0.1| train: unprocessed(700)
Param set #1:
left#0|right#1| raw: 1000
std_dev#0.5|
Once you've moved samples into a labeled group, you can't move them back. This is to avoid accidentally mixing samples between groups, possibly inflating the accuracy of machine learning models.
Preprocessing samples
You may have noticed that slurm_gen.list
noted 700 "unprocessed" samples. Once samples are in a group, you can apply preprocessors to them. Preprocessors must be defined in the same datasets.py
file. To continue the example, add the following preprocessor for our noisy_sine
dataset.
# added to datasets.py
@noisy_sine.preprocessor
def square_both(X, y):
"""Square both the inputs and the outputs."""
return [ex ** 2 for ex in X], [wai ** 2 for wai in y]
Note that the preprocessor is defined for one particular dataset. If the same preprocessor needs to be defined for multiple datasets, just add the decorators one after the other.
Preprocess some samples from 'train' by running the following command:
python -m slurm_gen.preprocess noisy_sine square_both train 600 -p 0
After the data is preprocessed, the output of python -m slurm_gen.list
will be
noisy_sine:
Param set #0:
left#-1|right#1| raw: 300
| train: unprocessed(700)
| : square_both(600)
Param set #1:
left#0|right#1| raw: 1000
std_dev#0.5|
Accessing the samples
To access the samples within Python, use the get_dataset
function:
from slurm_gen import Cache
# load those 700 samples as a training set
X, y = Cache("./example/")["noisy_sine"][0]["train"].get(700)
Object hierarchy
You saw in the example above that we accessed the data by indexing into the Cache
object. Here we describe the object hierarchy used by SLURM_gen within the Python environment.
Cache
'- Dataset
'- ParamSet
'- Group
:- raw data, accessed with `.get()`
'- PreprocessedData
'- preprocessed samples, accessed with `.get()`
Datasets can be indexed from a Cache
object in the following ways:
- by name, with a string (e.g. "noisy_sine")
- by number, in order of declaration (e.g. 0)
- by the actual dataset imported from
datasets.py
(e.g.from datasets import noisy_sine; Cache()[noisy_sine]
)
ParamSets can be indexed from a Dataset
object in the following ways:
- by number, as printed with
python -m slurm_gen.list
- by the string used as the directory name for the parameter set (e.g. "left#-1|right#1|std_dev#0.1")
- by the dict of parameter values (e.g. {"left": -1, "right": -1, "std_dev": 0.1})
- by a
DefaultParamObject
(e.g.NoisySineParams(left=-1, right=-1, std_dev=0.1)
)
Groups can be indexed from a ParamSet
object only by the string used as the name for the group (e.g. "train"). The Group
object has a .get()
method that unprocessed samples associated with the group. By default it returns all of them, but you can specify how many you want as a parameter.
PreprocessedData
objects can be indexed from a Group
object in the following ways:
- by preprocessor name (e.g. "square_both")
- by the actual preprocessor imported from
datasets.py
(e.g.from datasets import square_both; group[square_both]
)
The PreprocessedData
object has a .get()
method implementing the same functionality as that of Group
.
TODO
- Be more efficient with keeping track of the sizes of the datasets.
- Be able to preprocess on a SLURM job.