-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
1f9c23b
commit 75c95a2
Showing
1 changed file
with
197 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,197 @@ | ||
# ufoid | ||
Ultra Fast Optimized Image Deduplication. | ||
<div align="center"> | ||
|
||
 | ||
|
||
</div> | ||
|
||
# UFOID | ||
|
||
> Ultra Fast Optimized Image Deduplication. | ||
 | ||
 | ||
|
||
## Table of Contents | ||
|
||
- [Introduction](#introduction) | ||
- [Installation](#installation) | ||
- [Configuration](#configuration) | ||
- [Results](#results) | ||
- [Contributing](#contributing) | ||
- [License](#license) | ||
|
||
## Introduction | ||
|
||
The goal of this project is to efficiently detect and handle duplicate images within a dataset and across different | ||
datasets. The code uses perceptual hashing (image hashing) to convert images into hash representations, allowing for | ||
quick comparison and identification of duplicate images based on a specified distance threshold. | ||
|
||
The project provides two main functionalities: | ||
|
||
1. Duplicate detection within a single dataset using chunks: this method processes the dataset in smaller chunks to | ||
optimize performance for large datasets. | ||
2. Duplicate detection between two datasets using chunks: this method allows for comparison between a reference dataset | ||
and a new dataset to identify any overlapping duplicate images. | ||
|
||
## Installation | ||
|
||
In order to guarantee the proper operation of the application, the recommended python version is 3.9.2. | ||
|
||
### Clone UFOID | ||
|
||
```shell | ||
git clone https://github.com/immobiliare/ufoid | ||
cd ufoid | ||
``` | ||
|
||
### Create virtualenv and install requirements | ||
|
||
In order to create a clean environment for the execution of the application, a new virtualenv should be created | ||
inside the current folder, using the command | ||
|
||
```console | ||
python3 -m venv venv | ||
``` | ||
|
||
A new folder named `venv` will be created in `.` | ||
|
||
In order to activate the virtualenv, execute | ||
|
||
```console | ||
source venv/bin/activate | ||
``` | ||
|
||
and install python requirements executing | ||
|
||
```console | ||
pip install -r requirements.txt | ||
``` | ||
|
||
if you are interested in running the benchmarks and/or the tests execute also | ||
|
||
```console | ||
pip install -r requirements-benchmark.txt | ||
pip install -r requirements-test.txt | ||
``` | ||
|
||
A different approach consists in using the Makefile by running from the project root the command | ||
|
||
```console | ||
make | ||
``` | ||
|
||
This operation will: | ||
|
||
- create the venv; | ||
- update pip to the latest version; | ||
- install the requirements; | ||
- install the git hook. | ||
|
||
|
||
## Configuration | ||
|
||
Copy `ufoid/config/config.yaml.example` and rename it as `config.yaml` allows you to customize various aspects of the | ||
duplicate detection process. | ||
Here are some key parameters you can modify: | ||
|
||
- `num_processes`: Number of processes for parallel execution. | ||
- `chunk_length`: The length of each chunk for chunk-based processing. See below for more information. | ||
- `new_paths`: List of directory paths containing the new dataset for duplicate detection. | ||
- `old_paths`: List of directory paths containing the old dataset for comparison with the new dataset. | ||
- `check_with_itself`: Boolean flag to indicate whether to check for duplicates within the new dataset. | ||
- `check_with_old_data`: Boolean flag to indicate whether to check for duplicates between the new and old datasets. | ||
- `csv_output`: Boolean flag to indicate whether to save duplicate information to the output file. | ||
- `csv_output_file`: Path to the output file where duplicate information will be saved. | ||
- `delete_duplicates`: Boolean flag to indicate whether to delete duplicate images from the dataset. | ||
- `create_folder_with_no_duplicates`: Boolean flag to indicate whether to create a folder with non-duplicate images. | ||
- `new_folder`: Path to the folder where non-duplicate images will be stored. | ||
- `distance_threshold`: The distance threshold for considering images as duplicates. 10 is optimal for our use case, | ||
since it allows to get all the exact duplicate (also with some resilience to minor manipulations on images, while | ||
avoiding collisions. | ||
|
||
#### Chunk length | ||
|
||
`chunk_length` is an important parameter used in chunk-based processing to divide a large number of images into smaller, | ||
manageable chunks during the duplicate detection process. The size of each chunk is crucial as it directly affects | ||
memory usage and processing efficiency. By breaking down the dataset into chunks, we can prevent running out of memory | ||
and optimize the performance of the duplicate detection algorithm. | ||
|
||
The optimal value of `chunk_length` can vary depending on the hardware specifications of the machine running the | ||
process. For instance, on a machine with limited memory, such as an Apple M1 with 16 GB of RAM, a smaller `chunk_length` | ||
, like 20,000, has been found to work well. However, on a machine with more memory, a larger `chunk_length` might be | ||
suitable. | ||
|
||
It is essential to find a balance between the size of the chunks and the available system resources. If | ||
the `chunk_length` is too small, it may lead to a large number of iterations and increased overhead. On the other hand, | ||
if it is too large, it may cause the process to exhaust available memory, leading to performance issues or even crashes. | ||
|
||
To determine the appropriate `chunk_length` for your specific system, it is recommended to experiment with different | ||
values and observe the memory usage and performance of the duplicate detection process. This way, you can fine-tune the | ||
parameter to best suit your hardware configuration and dataset size. | ||
|
||
#### Distance threshold | ||
|
||
The `distance threshold` is a fundamental parameter that controls the level of sensitivity in duplicate image detection. | ||
By setting the distance threshold to 0, the algorithm will only identify exact duplicates—images that are pixel-by-pixel | ||
identical. As the threshold is increased, the detection becomes more permissive, allowing for the identification of | ||
near-duplicates with minor alterations such as changes in brightness, compression artifacts, or minor edits. | ||
|
||
For instance, when using a distance threshold of 10, the algorithm can identify quasi-duplicates, even if the images | ||
have undergone slight modifications, making it suitable for scenarios where minor image manipulations may occur. Raising | ||
the threshold to 30 enables the detection of images with more substantial changes, providing increased flexibility while | ||
still avoiding unnecessary collision with unrelated images. | ||
|
||
However, as the threshold reaches higher values, like 60, it becomes more inclusive and may lead to the detection of | ||
heavily manipulated images, potentially causing collisions with very similar images that are not actual duplicates. For | ||
example, two images may share a common scene, but one of them might have a significant addition, like a large printed | ||
word, which can trigger false positives. | ||
|
||
It's essential to choose an appropriate distance threshold based on the characteristics of the dataset and the specific | ||
use case. A balance must be struck between catching meaningful duplicates while minimizing the chances of identifying | ||
false positives or unrelated images as duplicates. Experimenting with different threshold values and observing the | ||
results will help determine the most suitable value for a particular scenario. | ||
|
||
## Results | ||
|
||
Start the script using the following command: | ||
|
||
```console | ||
python -m ufoid | ||
``` | ||
|
||
After running the script, the duplicate detection process will complete, and the output will be displayed in | ||
the console. The detected duplicate image pairs will be saved to the specified output file (if `txt_output` is set | ||
to `true`). | ||
|
||
Additionally, based on the configuration, a folder with non-duplicate images may be created, and duplicate images may be | ||
deleted from the dataset. | ||
|
||
## Changelog | ||
|
||
See [changelog](./CHANGELOG.md). | ||
|
||
## Contributing | ||
We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further | ||
discussion. | ||
|
||
If you plan to contribute new features, utility functions, or extensions to the core, please first open an issue and | ||
discuss the feature with us. | ||
Sending a PR without discussion might end up resulting in a rejected PR because we might be taking the core in a | ||
different direction than you might be aware of. | ||
|
||
To learn more about making a contribution, please see our [Contribution page](./CONTRIBUTING.md). | ||
|
||
## Powered apps | ||
|
||
UFOID was created by ImmobiliareLabs, the technology department of [Immobiliare.it](https://www.immobiliare.it), | ||
the #1 real estate company in Italy. | ||
|
||
**If you are using UFOID [drop us a message](mailto:opensource@immobiliare.it)**. | ||
|
||
## Support | ||
|
||
Made with ❤️ by [ImmobiliareLabs](https://github.com/immobiliare) and all the | ||
[contributors](./CONTRIBUTING.md#contributors) | ||
|
||
If you have any question on how to use UFOID, bugs and enhancement please feel free to reach us out by opening a | ||
[GitHub Issue](https://github.com/immobiliare/ufoid/issues). |