.. _modifiers_page: Modifiers ========= .. contents:: Modifier types: :local: :depth: 1 Modifiers are used to preprocess data from the input files, before handing them to the network. They are functions which are applied to both the samples, as well as the labels. They require the input and output layers of the network, as well as ever input set in the toml list file to be named. This makes it easy to assure that the right data is fed into the right layer of the network, especially if there are multiple inputs or outputs. All modifiers take a dict ``info_blob`` as input, which contains a subset of the following keys: **Possible keys in the info_blob dict** ``x_values`` : dict One batch of data from the ``cfg.key_x_values`` datagroup of the h5 file. Keys: Input set names from the toml list file. Values: Numpy array with x values from the respective file. If the datagroup is an indexed dataset, this will be a tuple of numpy arrays instead, with [0] being the values, and [1] being the number of items per sample. ``y_values`` : ndarray One batch of data from the ``cfg.key_y_values`` datagroup of the h5 file. If the content of the datagroup is a structured array, this will also be a structured array. ``phase`` : str Current phase the network is used in. Either 'training', 'validation' or 'inference'. Can be used to have modifiers with different behaviours depending on the phase. ``xs`` : dict One batch of data, resulting from applying the sample modifier on ``x_values``. Keys: Name of an input layer of the network. Values: Numpy array with samples for the respective layer. ``ys`` : dict One batch of data, resulting from applying the label modifier on ``y_values``, aka the true values the model will try to reproduce. Keys: The names of the output layers of the model. Values: One batch of labels as a numpy array. ``y_pred`` : dict One batch of data, resulting from applying the model on ``xs``, aka the model prediction for this batch of input data. Keys: The names of the output layers of the model. Values: One batch of predictions from the respective output layer of the model as a numpy array. For each modifier, a different subset of these entries will be available in the ``info_blob``. See below for which keys are accessible for which modifiers. **Hint:** If you have come up with a new modifier, it might be smart to test if it actually does what it should with the data. You can get a ``info_blob`` containing ``x_values`` and ``y_values`` from the files in your toml list (i.e. before any modifiers have been applied) like this: .. code-block:: python from orcanet.core import Organizer orga = Organizer(output_folder, list_file) info_blob = orga.io.get_batch() This will be exactly what is fed into your modifier when OrcaNet is run, so testing your new modifier on these will allow you to make sure they work. Label modifier -------------- The label modifier is used to generate the labels for the model from the ``y_values`` data of the h5 input files. Unless the label for the model is directly stored in the h5py files, the definition of a label modifier is mandatory. It must be of the following form: .. code-block:: python def my_label_modifier(info_blob): ... return ys **Contents of info_blob**: ``x_values``, ``y_values``, ``xs`` **Returns**: ``ys`` : dict (see above) It can be set via .. code-block:: python orga.cfg.label_modifier = my_label_modifier **Hint:** If no label modifier is given, the names of the output layers of the model have to appear as names of the dtypes in the ``y_values`` recarray. Then, each output layer will get data from the matching dataset. Example ^^^^^^^ Assume that we are using this simple classification model with one output, which is supplied with two different projections of our data at the same time (XY and ZY): .. code-block:: python inp_1 = Input((1,), name="input_layer_xy") inp_2 = Input((1,), name="input_layer_zy") x = Concatenate()([inp_1, inp_2]) output = Dense(2, name="classification")(x) example_model = Model((inp_1, inp_2), output) The output will be either [1,0] or [0,1] (one hot encoding), depending on whether the event is a neutrino or not. Suppose that in the mc_info of the input file, one of the fields has the name ``particle``, which is an int and 1 for neutrinos, or some other number for non-neutrinos. We need to convert this to the categorical output of the model with a label modifier: .. code-block:: python def label_modifier(info_blob): y_values = info_blob["y_values"] particle = y_values["particle"] # Create the label array for the output layer of shape (batchsize, 2) ntr_cat = np.zeros(particle.shape + (2, )) # If particle is 1, its a neutrino, so we want to have [1,0] ntr_cat[:, 0] = particle == 1 # If particle is not 1, we want [0,1] ntr_cat[:, 1] = particle != 1 # Make a dict to get the label to the correct output layer # the output layer is called "classification" in this model ys = dict() ys["classification"] = ntr_cat return ys Sample modifier --------------- The sample modifier function is applied to the ``x_values`` dict before it is fed into the input layers of the network. It must be of the following form: .. code-block:: python def my_sample_modifier(info_blob): ... return xs **Contents of info_blob**: ``x_values``, ``y_values`` **Returns** ``xs`` : dict (see above) It can be set via .. code-block:: python orga.cfg.sample_modifier = my_sample_modifier **Hint:** If no sample modifier is given, the names of the input sets in the toml list file (= the keys of ``x_values``) and the names of the input layers of the model have to be identical. Then, each input layer will get data from the toml input set with the same name. Example ^^^^^^^ Using the example classification model from above, assume that we have input files with data in XY- and in YZ-projections. In that case, the content of the toml list file could like this:: [xy] train_files = [ "data/xy_train.h5", ] validation_files = [ "data/xy_val.h5" ] [yz] train_files = [ "data/yz_train.h5", ] validation_files = [ "data/yz_val.h5" ] Let's say we want to feed the network XY- and ZY-projections instead, i.e. the axes of the YZ-projection need to be swapped. The following sample modifier will perform this operation: .. code-block:: python def sample_modifier(info_blob): x_values = info_blob["x_values"] xs = dict() xs["input_layer_xy"] = x_values["xy"] yz_data = x_values["yz"] xs["input_layer_zy"] = np.swapaxes(yz_data, 1, 2) # Axis 0 is the batchsize! return xs Dataset modifier ---------------- The dataset modifiers is only used when a model is evaluated with ``organizer.predict``. It will determine what is written in the resulting prediction h5 file. It must be of the following form: .. code-block:: python def my_dataset_modifier(info_blob) ... return datasets **Contents of info_blob**: ``y_values``, ``xs``, ``ys``, ``y_pred`` **Returns** ``datasets``: ``dict`` The datasets which will be created in the resulting h5 prediction file. Keys: Names of the datasets. Values: The content of each dataset as a numpy array. It can be set via .. code-block:: python orga.cfg.dataset_modifier = my_dataset_modifier **Hint:** If no dataset modifier is given, the following datasets will be created: y_values, and two sets for every output layer (label and pred).