Modifiers
Modifier types:
Modifiers are used to preprocess data from the input files, before handing them to the network. They are functions which are applied to both the samples, as well as the labels. They require the input and output layers of the network, as well as ever input set in the toml list file to be named. This makes it easy to assure that the right data is fed into the right layer of the network, especially if there are multiple inputs or outputs.
All modifiers take a dict info_blob
as input, which contains a subset of
the following keys:
- Possible keys in the info_blob dict
x_values
dictOne batch of data from the
cfg.key_x_values
datagroup of the h5 file.Keys: Input set names from the toml list file.
Values: Numpy array with x values from the respective file. If the datagroup is an indexed dataset, this will be a tuple of numpy arrays instead, with [0] being the values, and [1] being the number of items per sample.
y_values
ndarrayOne batch of data from the
cfg.key_y_values
datagroup of the h5 file. If the content of the datagroup is a structured array, this will also be a structured array.phase
strCurrent phase the network is used in. Either ‘training’, ‘validation’ or ‘inference’. Can be used to have modifiers with different behaviours depending on the phase.
xs
dictOne batch of data, resulting from applying the sample modifier on
x_values
.Keys: Name of an input layer of the network.
Values: Numpy array with samples for the respective layer.
ys
dictOne batch of data, resulting from applying the label modifier on
y_values
, aka the true values the model will try to reproduce.Keys: The names of the output layers of the model.
Values: One batch of labels as a numpy array.
y_pred
dictOne batch of data, resulting from applying the model on
xs
, aka the model prediction for this batch of input data.Keys: The names of the output layers of the model.
Values: One batch of predictions from the respective output layer of the model as a numpy array.
For each modifier, a different subset of these entries will be available in the
info_blob
.
See below for which keys are accessible for which modifiers.
Hint: If you have come up with a new modifier, it might be smart to test if it
actually does what it should with the data.
You can get a info_blob
containing x_values
and y_values
from
the files in your toml list (i.e. before any modifiers have been
applied) like this:
from orcanet.core import Organizer
orga = Organizer(output_folder, list_file)
info_blob = orga.io.get_batch()
This will be exactly what is fed into your modifier when OrcaNet is run, so testing your new modifier on these will allow you to make sure they work.
Label modifier
The label modifier is used to generate the labels for the model from the
y_values
data of the h5 input files. Unless the label for the model
is directly stored in the h5py files, the definition of a label modifier
is mandatory.
It must be of the following form:
def my_label_modifier(info_blob):
...
return ys
Contents of info_blob:
x_values
,y_values
,xs
Returns:
ys
: dict (see above)
It can be set via
orga.cfg.label_modifier = my_label_modifier
Hint: If no label modifier is given, the names of the output layers of
the model have to appear as names of the dtypes in the y_values
recarray.
Then, each output layer will get data from the matching dataset.
Example
Assume that we are using this simple classification model with one output, which is supplied with two different projections of our data at the same time (XY and ZY):
inp_1 = Input((1,), name="input_layer_xy")
inp_2 = Input((1,), name="input_layer_zy")
x = Concatenate()([inp_1, inp_2])
output = Dense(2, name="classification")(x)
example_model = Model((inp_1, inp_2), output)
The output will be either [1,0] or [0,1] (one hot encoding), depending on
whether the event is a neutrino or not.
Suppose that in the mc_info of the input file, one of the
fields has the name particle
, which is an int and 1 for neutrinos, or
some other number for non-neutrinos.
We need to convert this to the categorical output of the model with a label
modifier:
def label_modifier(info_blob):
y_values = info_blob["y_values"]
particle = y_values["particle"]
# Create the label array for the output layer of shape (batchsize, 2)
ntr_cat = np.zeros(particle.shape + (2, ))
# If particle is 1, its a neutrino, so we want to have [1,0]
ntr_cat[:, 0] = particle == 1
# If particle is not 1, we want [0,1]
ntr_cat[:, 1] = particle != 1
# Make a dict to get the label to the correct output layer
# the output layer is called "classification" in this model
ys = dict()
ys["classification"] = ntr_cat
return ys
Sample modifier
The sample modifier function is applied to the x_values
dict before it is
fed into the input layers of the network.
It must be of the following form:
def my_sample_modifier(info_blob):
...
return xs
Contents of info_blob:
x_values
,y_values
Returns
xs
: dict (see above)
It can be set via
orga.cfg.sample_modifier = my_sample_modifier
Hint: If no sample modifier is given, the names of the input sets in the
toml list file (= the keys of x_values
) and the names of the input layers of
the model have to be
identical. Then, each input layer will get data from the toml input set
with the same name.
Example
Using the example classification model from above, assume that we have input files with data in XY- and in YZ-projections. In that case, the content of the toml list file could like this:
[xy]
train_files = [
"data/xy_train.h5",
]
validation_files = [
"data/xy_val.h5"
]
[yz]
train_files = [
"data/yz_train.h5",
]
validation_files = [
"data/yz_val.h5"
]
Let’s say we want to feed the network XY- and ZY-projections instead, i.e. the axes of the YZ-projection need to be swapped. The following sample modifier will perform this operation:
def sample_modifier(info_blob):
x_values = info_blob["x_values"]
xs = dict()
xs["input_layer_xy"] = x_values["xy"]
yz_data = x_values["yz"]
xs["input_layer_zy"] = np.swapaxes(yz_data, 1, 2) # Axis 0 is the batchsize!
return xs
Dataset modifier
The dataset modifiers is only used when a model is evaluated with
organizer.predict
.
It will determine what is written in the resulting
prediction h5 file.
It must be of the following form:
def my_dataset_modifier(info_blob)
...
return datasets
Contents of info_blob:
y_values
,xs
,ys
,y_pred
Returns
datasets
:dict
The datasets which will be created in the resulting h5 prediction file.
Keys: Names of the datasets.
Values: The content of each dataset as a numpy array.
It can be set via
orga.cfg.dataset_modifier = my_dataset_modifier
Hint: If no dataset modifier is given, the following datasets will be created: y_values, and two sets for every output layer (label and pred).