Modifiers

Modifiers are used to preprocess data from the input files, before handing them to the network. They are functions which are applied to both the samples, as well as the labels. They require the input and output layers of the network, as well as ever input set in the toml list file to be named. This makes it easy to assure that the right data is fed into the right layer of the network, especially if there are multiple inputs or outputs.

All modifiers take a dict info_blob as input, which contains a subset of the following keys:

Possible keys in the info_blob dict
x_valuesdict

One batch of data from the cfg.key_x_values datagroup of the h5 file.

Keys: Input set names from the toml list file.

Values: Numpy array with x values from the respective file. If the datagroup is an indexed dataset, this will be a tuple of numpy arrays instead, with [0] being the values, and [1] being the number of items per sample.

y_valuesndarray

One batch of data from the cfg.key_y_values datagroup of the h5 file. If the content of the datagroup is a structured array, this will also be a structured array.

phasestr

Current phase the network is used in. Either ‘training’, ‘validation’ or ‘inference’. Can be used to have modifiers with different behaviours depending on the phase.

xsdict

One batch of data, resulting from applying the sample modifier on x_values.

Keys: Name of an input layer of the network.

Values: Numpy array with samples for the respective layer.

ysdict

One batch of data, resulting from applying the label modifier on y_values, aka the true values the model will try to reproduce.

Keys: The names of the output layers of the model.

Values: One batch of labels as a numpy array.

y_preddict

One batch of data, resulting from applying the model on xs, aka the model prediction for this batch of input data.

Keys: The names of the output layers of the model.

Values: One batch of predictions from the respective output layer of the model as a numpy array.

For each modifier, a different subset of these entries will be available in the info_blob. See below for which keys are accessible for which modifiers.

Hint: If you have come up with a new modifier, it might be smart to test if it actually does what it should with the data. You can get a info_blob containing x_values and y_values from the files in your toml list (i.e. before any modifiers have been applied) like this:

from orcanet.core import Organizer

orga = Organizer(output_folder, list_file)
info_blob = orga.io.get_batch()

This will be exactly what is fed into your modifier when OrcaNet is run, so testing your new modifier on these will allow you to make sure they work.

Label modifier

The label modifier is used to generate the labels for the model from the y_values data of the h5 input files. Unless the label for the model is directly stored in the h5py files, the definition of a label modifier is mandatory.

It must be of the following form:

def my_label_modifier(info_blob):
    ...
    return ys

Contents of info_blob:

x_values, y_values, xs

Returns:

ys : dict (see above)

It can be set via

orga.cfg.label_modifier = my_label_modifier

Hint: If no label modifier is given, the names of the output layers of the model have to appear as names of the dtypes in the y_values recarray. Then, each output layer will get data from the matching dataset.

Example

Assume that we are using this simple classification model with one output, which is supplied with two different projections of our data at the same time (XY and ZY):

inp_1 = Input((1,), name="input_layer_xy")
inp_2 = Input((1,), name="input_layer_zy")

x = Concatenate()([inp_1, inp_2])

output = Dense(2, name="classification")(x)

example_model = Model((inp_1, inp_2), output)

The output will be either [1,0] or [0,1] (one hot encoding), depending on whether the event is a neutrino or not. Suppose that in the mc_info of the input file, one of the fields has the name particle, which is an int and 1 for neutrinos, or some other number for non-neutrinos. We need to convert this to the categorical output of the model with a label modifier:

def label_modifier(info_blob):
    y_values = info_blob["y_values"]
    particle = y_values["particle"]
    # Create the label array for the output layer of shape (batchsize, 2)
    ntr_cat = np.zeros(particle.shape + (2, ))
    # If particle is 1, its a neutrino, so we want to have [1,0]
    ntr_cat[:, 0] = particle == 1
    # If particle is not 1, we want [0,1]
    ntr_cat[:, 1] = particle != 1
    # Make a dict to get the label to the correct output layer
    # the output layer is called "classification" in this model
    ys = dict()
    ys["classification"] = ntr_cat
    return ys

Sample modifier

The sample modifier function is applied to the x_values dict before it is fed into the input layers of the network. It must be of the following form:

def my_sample_modifier(info_blob):
    ...
    return xs

Contents of info_blob:

x_values, y_values

Returns

xs : dict (see above)

It can be set via

orga.cfg.sample_modifier = my_sample_modifier

Hint: If no sample modifier is given, the names of the input sets in the toml list file (= the keys of x_values) and the names of the input layers of the model have to be identical. Then, each input layer will get data from the toml input set with the same name.

Example

Using the example classification model from above, assume that we have input files with data in XY- and in YZ-projections. In that case, the content of the toml list file could like this:

[xy]
train_files = [
"data/xy_train.h5",
]

validation_files = [
"data/xy_val.h5"
]

[yz]
train_files = [
"data/yz_train.h5",
]

validation_files = [
"data/yz_val.h5"
]

Let’s say we want to feed the network XY- and ZY-projections instead, i.e. the axes of the YZ-projection need to be swapped. The following sample modifier will perform this operation:

def sample_modifier(info_blob):
    x_values = info_blob["x_values"]
    xs = dict()

    xs["input_layer_xy"] = x_values["xy"]

    yz_data = x_values["yz"]
    xs["input_layer_zy"] = np.swapaxes(yz_data, 1, 2)  # Axis 0 is the batchsize!

    return xs

Dataset modifier

The dataset modifiers is only used when a model is evaluated with organizer.predict. It will determine what is written in the resulting prediction h5 file. It must be of the following form:

def my_dataset_modifier(info_blob)
    ...
    return datasets

Contents of info_blob:

y_values, xs, ys, y_pred

Returns

datasets: dict

The datasets which will be created in the resulting h5 prediction file.

Keys: Names of the datasets.

Values: The content of each dataset as a numpy array.

It can be set via

orga.cfg.dataset_modifier = my_dataset_modifier

Hint: If no dataset modifier is given, the following datasets will be created: y_values, and two sets for every output layer (label and pred).