:py:mod:`orcanet.h5_generator` ============================== .. py:module:: orcanet.h5_generator Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: orcanet.h5_generator.Hdf5BatchGenerator Functions ~~~~~~~~~ .. autoapisummary:: orcanet.h5_generator.get_h5_generator orcanet.h5_generator.make_dataset .. py:class:: Hdf5BatchGenerator(files_dict, batchsize=64, key_x_values='x', key_y_values='y', sample_modifier=None, label_modifier=None, fixed_batchsize=False, y_field_names=None, phase='training', xs_mean=None, f_size=None, keras_mode=True, shuffle=False, class_weights=None) Base object for fitting to a sequence of data, such as a dataset. Every `Sequence` must implement the `__getitem__` and the `__len__` methods. If you want to modify your dataset between epochs you may implement `on_epoch_end`. The method `__getitem__` should return a complete batch. Notes: `Sequence` are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators. Examples: ```python from skimage.io import imread from skimage.transform import resize import numpy as np import math # Here, `x_set` is list of path to the images # and `y_set` are the associated classes. class CIFAR10Sequence(Sequence): def __init__(self, x_set, y_set, batch_size): self.x, self.y = x_set, y_set self.batch_size = batch_size def __len__(self): return math.ceil(len(self.x) / self.batch_size) def __getitem__(self, idx): batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size] batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size] return np.array([ resize(imread(file_name), (200, 200)) for file_name in batch_x]), np.array(batch_y) ``` .. !! processed by numpydoc !! .. py:method:: pad_to_size(info_blob) Pad the batch to have a fixed batchsize. .. !! processed by numpydoc !! .. py:method:: open() Open all files and prepare for read out. .. !! processed by numpydoc !! .. py:method:: close() Close all files again. .. !! processed by numpydoc !! .. py:method:: get_x_values(start_index) Read one batch of samples from the files and zero center. :Parameters: **start_index** : int The start index in the h5 files at which the batch will be read. The end index will be the start index + the batch size. :Returns: **x_values** : dict One batch of data for each input file. .. !! processed by numpydoc !! .. py:method:: get_y_values(start_index) Get y_values for the nn. Since the y_values are hopefully the same for all the files, use the ones from the first. TODO add check :Parameters: **start_index** : int The start index in the h5 files at which the batch will be read. The end index will be the start index + the batch size. :Returns: **y_values** : ndarray The y_values, right from the files. .. !! processed by numpydoc !! .. py:method:: print_timestats(print_func=None) Print stats about how long it took to read batches. .. !! processed by numpydoc !! .. py:method:: get_file_meta() Meta information about the files. Only read out once. .. !! processed by numpydoc !! .. py:function:: get_h5_generator(orga, files_dict, f_size=None, zero_center=False, keras_mode=True, shuffle=False, use_def_label=True, phase='training') Initialize the hdf5_batch_generator_base with the paramters in orga.cfg. :Parameters: **orga** : orcanet.core.Organizer Contains all the configurable options in the OrcaNet scripts. **files_dict** : dict Pathes of the files to train on. Keys: The name of every input (from the toml list file, can be multiple). Values: The filepath of a single h5py file to read samples from. **f_size** : int or None Specifies the number of samples to be read from the .h5 file. If none, the whole .h5 file will be used. **zero_center** : bool Whether to use zero centering. Requires orga.zero_center_folder to be set. **keras_mode** : bool Specifies if mc-infos (y_values) should be yielded as well. The mc-infos are used for evaluation after training and testing is finished. **shuffle** : bool Randomize the order in which batches are read from the file. Significantly reduces read out speed. **use_def_label** : bool If True and no label modifier is given by user, use the default label modifier instead of none. :Yields: **xs** : dict Data for the model train on. Keys : str The name(s) of the input layer(s) of the model. Values : ndarray A batch of samples for the corresponding input. **ys** : dict or None Labels for the model to train on. Keys : str The name(s) of the output layer(s) of the model. Values : ndarray A batch of labels for the corresponding output. Will be None if there are no labels in the file. **y_values** : ndarray, optional Y values from the file. Only yielded if yield_mc_info is True. .. !! processed by numpydoc !! .. py:function:: make_dataset(gen)