From 75c95a271a45aaf9c0c74b7d97a7e3aaf4e8e3e1 Mon Sep 17 00:00:00 2001 From: lorenzomassimiani Date: Tue, 15 Oct 2024 14:56:45 +0200 Subject: [PATCH] chore: README update. --- README.md | 199 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 197 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index bf427b4..454c253 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,197 @@ -# ufoid -Ultra Fast Optimized Image Deduplication. +
+ +![ImmoLogo](.github/example_images/ImmobiliareLabs_Logo_Negative.png) + +
+ +# UFOID + +> Ultra Fast Optimized Image Deduplication. + +![Test](https://github.com/immobiliare/ufoid/actions/workflows/ci.yaml/badge.svg) +![Python 3.9](https://img.shields.io/badge/Python-3.9+-blue) + +## Table of Contents + +- [Introduction](#introduction) +- [Installation](#installation) +- [Configuration](#configuration) +- [Results](#results) +- [Contributing](#contributing) +- [License](#license) + +## Introduction + +The goal of this project is to efficiently detect and handle duplicate images within a dataset and across different +datasets. The code uses perceptual hashing (image hashing) to convert images into hash representations, allowing for +quick comparison and identification of duplicate images based on a specified distance threshold. + +The project provides two main functionalities: + +1. Duplicate detection within a single dataset using chunks: this method processes the dataset in smaller chunks to + optimize performance for large datasets. +2. Duplicate detection between two datasets using chunks: this method allows for comparison between a reference dataset + and a new dataset to identify any overlapping duplicate images. + +## Installation + +In order to guarantee the proper operation of the application, the recommended python version is 3.9.2. + +### Clone UFOID + +```shell +git clone https://github.com/immobiliare/ufoid +cd ufoid +``` + +### Create virtualenv and install requirements + +In order to create a clean environment for the execution of the application, a new virtualenv should be created +inside the current folder, using the command + +```console +python3 -m venv venv +``` + +A new folder named `venv` will be created in `.` + +In order to activate the virtualenv, execute + +```console +source venv/bin/activate +``` + +and install python requirements executing + +```console +pip install -r requirements.txt +``` + +if you are interested in running the benchmarks and/or the tests execute also + +```console +pip install -r requirements-benchmark.txt +pip install -r requirements-test.txt +``` + +A different approach consists in using the Makefile by running from the project root the command + +```console +make +``` + +This operation will: + +- create the venv; +- update pip to the latest version; +- install the requirements; +- install the git hook. + + +## Configuration + +Copy `ufoid/config/config.yaml.example` and rename it as `config.yaml` allows you to customize various aspects of the +duplicate detection process. +Here are some key parameters you can modify: + +- `num_processes`: Number of processes for parallel execution. +- `chunk_length`: The length of each chunk for chunk-based processing. See below for more information. +- `new_paths`: List of directory paths containing the new dataset for duplicate detection. +- `old_paths`: List of directory paths containing the old dataset for comparison with the new dataset. +- `check_with_itself`: Boolean flag to indicate whether to check for duplicates within the new dataset. +- `check_with_old_data`: Boolean flag to indicate whether to check for duplicates between the new and old datasets. +- `csv_output`: Boolean flag to indicate whether to save duplicate information to the output file. +- `csv_output_file`: Path to the output file where duplicate information will be saved. +- `delete_duplicates`: Boolean flag to indicate whether to delete duplicate images from the dataset. +- `create_folder_with_no_duplicates`: Boolean flag to indicate whether to create a folder with non-duplicate images. +- `new_folder`: Path to the folder where non-duplicate images will be stored. +- `distance_threshold`: The distance threshold for considering images as duplicates. 10 is optimal for our use case, + since it allows to get all the exact duplicate (also with some resilience to minor manipulations on images, while + avoiding collisions. + +#### Chunk length + +`chunk_length` is an important parameter used in chunk-based processing to divide a large number of images into smaller, +manageable chunks during the duplicate detection process. The size of each chunk is crucial as it directly affects +memory usage and processing efficiency. By breaking down the dataset into chunks, we can prevent running out of memory +and optimize the performance of the duplicate detection algorithm. + +The optimal value of `chunk_length` can vary depending on the hardware specifications of the machine running the +process. For instance, on a machine with limited memory, such as an Apple M1 with 16 GB of RAM, a smaller `chunk_length` +, like 20,000, has been found to work well. However, on a machine with more memory, a larger `chunk_length` might be +suitable. + +It is essential to find a balance between the size of the chunks and the available system resources. If +the `chunk_length` is too small, it may lead to a large number of iterations and increased overhead. On the other hand, +if it is too large, it may cause the process to exhaust available memory, leading to performance issues or even crashes. + +To determine the appropriate `chunk_length` for your specific system, it is recommended to experiment with different +values and observe the memory usage and performance of the duplicate detection process. This way, you can fine-tune the +parameter to best suit your hardware configuration and dataset size. + +#### Distance threshold + +The `distance threshold` is a fundamental parameter that controls the level of sensitivity in duplicate image detection. +By setting the distance threshold to 0, the algorithm will only identify exact duplicates—images that are pixel-by-pixel +identical. As the threshold is increased, the detection becomes more permissive, allowing for the identification of +near-duplicates with minor alterations such as changes in brightness, compression artifacts, or minor edits. + +For instance, when using a distance threshold of 10, the algorithm can identify quasi-duplicates, even if the images +have undergone slight modifications, making it suitable for scenarios where minor image manipulations may occur. Raising +the threshold to 30 enables the detection of images with more substantial changes, providing increased flexibility while +still avoiding unnecessary collision with unrelated images. + +However, as the threshold reaches higher values, like 60, it becomes more inclusive and may lead to the detection of +heavily manipulated images, potentially causing collisions with very similar images that are not actual duplicates. For +example, two images may share a common scene, but one of them might have a significant addition, like a large printed +word, which can trigger false positives. + +It's essential to choose an appropriate distance threshold based on the characteristics of the dataset and the specific +use case. A balance must be struck between catching meaningful duplicates while minimizing the chances of identifying +false positives or unrelated images as duplicates. Experimenting with different threshold values and observing the +results will help determine the most suitable value for a particular scenario. + +## Results + +Start the script using the following command: + +```console +python -m ufoid +``` + +After running the script, the duplicate detection process will complete, and the output will be displayed in +the console. The detected duplicate image pairs will be saved to the specified output file (if `txt_output` is set +to `true`). + +Additionally, based on the configuration, a folder with non-duplicate images may be created, and duplicate images may be +deleted from the dataset. + +## Changelog + +See [changelog](./CHANGELOG.md). + +## Contributing +We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further +discussion. + +If you plan to contribute new features, utility functions, or extensions to the core, please first open an issue and +discuss the feature with us. +Sending a PR without discussion might end up resulting in a rejected PR because we might be taking the core in a +different direction than you might be aware of. + +To learn more about making a contribution, please see our [Contribution page](./CONTRIBUTING.md). + +## Powered apps + +UFOID was created by ImmobiliareLabs, the technology department of [Immobiliare.it](https://www.immobiliare.it), +the #1 real estate company in Italy. + +**If you are using UFOID [drop us a message](mailto:opensource@immobiliare.it)**. + +## Support + +Made with ❤️ by [ImmobiliareLabs](https://github.com/immobiliare) and all the +[contributors](./CONTRIBUTING.md#contributors) + +If you have any question on how to use UFOID, bugs and enhancement please feel free to reach us out by opening a +[GitHub Issue](https://github.com/immobiliare/ufoid/issues).