Checks a given folder and subfolders for double files by calculating the corresponding SHA1.

class DoubleFileChecker[source]

DoubleFileChecker(path, reverse=False)

Checks a given folder and subfolders for double files by calculating the corresponding SHA1. path: the folder to process reverse: if True, order the file reverse

DoubleFileChecker.check[source]

DoubleFileChecker.check()

The main validation logic.

Helper Methods

configure_logging[source]

configure_logging(logging_level=20)

Configures logging for the system.

:param logging_level: The logging level to use.

get_file_sha[source]

get_file_sha(fname)

Calculates the SHA1 of a given file. fname: the file path return: the calculated SHA1 as hex

check_double_entries[source]

check_double_entries(entries)

Process a list of tuples of filenames and corresponding hash for double hashes. entries: the list of entries with their hashes returns: a dictionary containing double entries by hash

remove_entries[source]

remove_entries(entries, to_remove, delete_source=False)

Removes entries from list and optionally remove the source file as well. entries: the list of entries to remove from to_remove: the list of entries to remove delete_source: werether or not to delete the source file as well returns: a list of resulting entries

scan_folder[source]

scan_folder(folder)

Scans a folder and subfolders for image content. folder: the folder to scan returns: a list of paths to images found

Run from command line

To run the data-set builder from command line, use the following command: python -m mlcore.tools.check_double_images [parameters]

The following parameters are supported:

  • [folder]: The folder to scan.