Dataset generator Notes.

Creating a dataset for a classification or segmentation task. If an annotation file is present, the annotations are also prepared. The dataset is created based on an imageset.


Imagesets are collected images to build a data-set from, stored in the imagesets folder. The imagesets folder contains the following folder structure:

  • imagesets/[imageset_type]/[imageset_name]

Inside the [imageset_name] folder are the following files / folders

  • test/: test images (benchmark)
  • trainval/: training and validation images for cross validation
  • categories.txt: all categories (classes) the imageset contains

Dataset Folders

Datasets are stored in the datasets base folder. The datasets folder contains the following folder structure:

  • datasets/[dataset_type]/[dataset_name] where [dataset_type] is the same as the corresponding [imageset_type] and [dataset_name] is the same as the corresponding [imageset_name].

Inside the [dataset_name] folder are the following files / folders

  • test/: test set (benchmark)
  • train/: training set
  • val/: validation set
  • categories.txt: all categories (classes) the dataset contains

Helper Methods



Configures logging for the system.

Build a data-set

To build a data-set from an image-set. Handles currently classification and segmentation image-sets taken from the image-set-type, which is the parent folder, the image-set folder is located in.


generate(dataset:Dataset, log_memory_handler)

Generate a dataset. dataset: the dataset to build log_memory_handler: the log handler for the build log

Run from command line

To run the data-set builder from command line, use the following command: python -m mlcore.dataset [parameters]

The following parameters are supported:

  • [categories]: The path to the categories file. (e.g.: imagesets/segmentation/car_damage/categories.txt)
  • --annotation: The path to the image-set annotation file, the data-set is build from. (e.g.: imagesets/classification/car_damage/annotations.csv for classification, imagesets/segmentation/car_damage/via_region_data.json for segmentation)
  • --split: The percentage of the data which belongs to validation set, default to 0.2 (=20%)
  • --seed: A random seed to reproduce splits, default to None
  • --category-label-key: The key, the category name can be found in the annotation file, default to category.
  • --sample: The percentage of the data which will be copied as a sample set with in a separate folder with "_sample" suffix. If not set, no sample data-set will be created.
  • --type: The type of the data-set, if not explicitly set try to infer from categories file path.
  • --tfrecord: Also create .tfrecord files.
  • --join-overlapping-regions: Whether overlapping regions of same category should be joined.
  • --annotation-area-thresh: Keep only annotations with minimum size (width or height) related to image size.
  • --output: The path of the dataset folder, default to ../datasets.
  • --name: The name of the data-set, if not explicitly set try to infer from categories file path.