One of the most important steps in data science or the development of machine learning models is the collection of data. Having high-quality data is the key ingredient to training any successful algorithm. Unfortunately, collecting data, especially qualitative data, is one of the most complicated steps. Imagine you want to train a model that recognizes people wearing purple sunglasses, a red shirt, and yellow shoes. Think about it, where would you be able to find a dataset that includes people meeting those criteria? It would not be an easy task, right?
In most cases, data scientists spend a significant amount of their time finding the correct data, and this is typically a very time-consuming task. Especially in the world of video, typically terabytes of data are analyzed or labeled manually, and most of the time the correct dataset is not readily available. To make it even worse, machine learning models require up-to-date training data, which means that collecting data is not a one-time job; it's a continuous process that never stops. This means that the developers of these models require an automated and smart approach for creating those datasets. This is where data filtering becomes crucial for the concept of automated dataset harvesting.
Automated dataset harvesting involves using algorithms and tools to continuously collect, filter, and update datasets to ensure they meet the required criteria for training machine learning models. This approach not only saves time but also ensures that the data remains relevant and high-quality, which is essential for the success of any machine learning project.
The Data Harvesting Project focuses on automating the process of collecting and updating datasets for training machine learning models, particularly in the domain of video-based data. It aims to streamline the process of creating datasets by utilizing automated harvesting and data filtering techniques, enabling continuous updates and improving the quality and relevance of the data collected.
This project contains several key components responsible for different stages of the data harvesting process:
-
- Contains all the pretrained models used for making predictions, newly added models should be moved to this folder as this will be read automatically while triggering models.
- The models should be
YOLO
compatible since they are accessed uniformly usingultralytics/YOLO
.
-
- Contains the implementation for different projects using
Factory Pattern
. First time hearing about this term?- Short explanation: The underlying principle is to create objects without exposing the creation logic to the client, rather defining a common interface to refer to the newly created objects. So the newly added part of the Factory could be called directly via
init()
. - Long explanation: Learn more
- Short explanation: The underlying principle is to create objects without exposing the creation logic to the client, rather defining a common interface to refer to the newly created objects. So the newly added part of the Factory could be called directly via
- Each project might have the specific conditions/requirements for the models.
- We added
helmet
project as an example: In this project, Person and Helmet need to appear in the frame, and the bounding boxes of Person need to be minimummin_height
andmin_width
parameters,... . Those specific conditions could be read and handled viacondition_func
abstracted and relevant parameters could be found inproject_config.yaml
.
- We added
- Contains the implementation for different projects using
-
- Handles the conversion of prediction results into exportable format, also supports dataset versioning.
- After prediction of the entire video is made, the results need to be formatted before being uploaded. This folder contains logic for converting results into supported format, currently have:
yolov8
: Works for YOLOv8 (tested) and others Ultralytics YOLO models, more information about the directory structure hereflat
: Basically keeps both image and annotation in the same directory.
-
- Manages cloud platform interactions, specifically uploading and downloading dataset.
- After the predictions are processed and exported to dedicated format, this module handles the uploading of the final dataset, currently have:
s3
: Any S3 compatible platform such as wasabi,... You need to provide the following necessary variables in.env
file underS3 parameters
:S3_ENDPOINT
,S3_ACCESS_KEY
,S3_SECRET_KEY
,S3_BUCKET
.roboflow
: common platform for storing and managing deep learning dataset Learn more. You need to provide the following necessary variables in.env
file underRoboflow parameters
:RBF_API_KEY
,RBF_WORKSPACE
,RBF_PROJECT
.- For more information about
.env
parameter, see .env parameters
-
- Contains the core logic for the harvesting service.
- Manages tasks such as ejecting other components, reads keys the message broker and from your vault, extracting frames from videos, evaluating video content, and deleting media after it’s processed.
Beside the well organized directory structure above, below files are also interesting:
single-shot.py
: This is a test version of the harvesting process, designed to work with a single video. It allows you to test and validate the extraction and processing pipeline on a smaller scale before applying to the scaled upqueue_harvesting.py
.queue_harvesting.py
: The main script of the project that acts as the controller for managing the workflow and interactions between different components. This script orchestrates the entire data harvesting process, ensuring seamless integration with cloud resources and data repositories.condition.py
: Handles the logic for model predictions. Manages how models are applied to the frames, controlling the conditions for predictions and processing.- It performs object detection using list of YOLO models as defined in
project_config.yaml
. For each model, uses YOLO Object Tracking:MODEL.track()
method is used to perform object detection on the frame. Tracking is enabled to maintain unique IDs for detected objects across frames.- For more information and parameter please check: YOLOv8 model.track
- After processed, results are merged and sorted based on accuracy, and mapped based on labels of the first model.
- Duplicate removal: To avoid duplicated boxes detected for an object, by get rid of boxes that have the same label and similar coordinate as the highest accuracy box.
- Crop frame and transform annotations: To reduce storage waste while storing the dataset, frames are cropped to get only ROIs (Region of Interest) areas, then transform the annotations accordingly to fit the frame.
- It performs object detection using list of YOLO models as defined in
.env
: This file contains environment-specific variables that are used to configure the scripts without hard-coding sensitive information. Typical variables might include API keys, database URLs, or credentials needed to access cloud services. Ensure that this file is properly configured before running the scripts, and keep it secure to prevent unauthorized access.
To quickly get started with the data-harvesting repo and set up a new project, follow these steps:
-
Add pretrained models:
- Place your pre-trained model(s) in the
models/
folder. - These models will be used for making predictions during the data processing pipeline.
- Place your pre-trained model(s) in the
-
Configure
.env
:- Open the
.env
file in the root of the repository. - Add a new entry for your project by setting the
PROJECT_NAME
variable to the name of your project folder:- For example:
PROJECT_NAME=your_project_name
- For example:
- Open the
-
Create a New Project:
- In the
projects/
folder, create a new folder for your project. - Inside this folder, define the project-specific logic.
- In the
-
Register the Project with
Project_Factory
:- In the
projects/
folder, locate and update theProject_Factory.py
file. - Add the path to your newly created project by importing it and registering it in the factory pattern under your newly created
your_project_name
in step 2. This ensures that your project can be initialized dynamically when selected without affecting other parts.
- In the
-
Add Project Configuration
- Create a project_config.yaml file in your project folder. it should have these compulsory formats:
# Example project_config.yaml models: - helmet_detector_1k_16b_150e.pt - yolov8x.pt allowed_classes: - [0, 1, 2] - [0]
- Where
models
are list of used models, should be at least 1, andallowed_classes
are according classes/labels that you want the model to predict. - Besides, you can add optional parameters below, they should be easily called in the project.
-
Implement Project Logic:
-
Inside your new project folder, reference the base class provided by the repository.
class Your_Project(BaseProject, IYour_Project): def condition_func(self): # This function should be implemented with your own condition.
-
Implement the following:
condition_func
: Define how your project will handle model predictions and conditions.
-
Once you’ve completed these steps, your new project will be ready for integration with the data-harvesting pipeline!
Below is a list of the environment variables that need to be provided. Each parameter is described along with its purpose.
Name | Value | Explanation |
---|---|---|
PROJECT_NAME |
your_project_name |
The name of the project, this name should be added accordingly to project_factory.py . |
DATASET_FORMAT |
yolov8 , flat |
The export format of the dataset used for processing. Has to be 1 of the mentioned, more at exports/ . |
DATASET_VERSION |
0.0.0 , 0.0.1 , ... |
The version of the dataset being used for version control, customize it as your own use. |
DATASET_UPLOAD |
False , True |
Specifies whether the dataset should be uploaded to integrations/ or not. |
FORWARDING_MEDIA |
True , False |
Indicates if media should be forwarded. |
REMOVE_AFTER_PROCESSED |
True , False |
Specifies whether to remove media files after they have been processed. |
QUEUE_NAME |
your_queue |
The name of the queue used for managing tasks. |
TARGET_QUEUE_NAME |
The name of the target queue, if applicable. | |
QUEUE_EXCHANGE |
The exchange used for the message broker. | |
QUEUE_HOST |
your_queue_host e.g: rabbitmq-broker.xxx.xx:5672 |
The host and port for the RabbitMQ broker. |
QUEUE_USERNAME |
your_queue_username |
The username for accessing the message broker. |
QUEUE_PASSWORD |
your_queue_password |
The password for accessing the message broker. |
STORAGE_URI |
your_uri e.g: https://vault.xxx.xx/api |
The URI for accessing the Kerberos Vault. More information at Keberos Vault. |
STORAGE_ACCESS_KEY |
your_storage_access_key |
The access key for the Kerberos Vault. More information at Keberos Vault. |
STORAGE_SECRET_KEY |
your_storage_secret_key |
The secret key for the Kerberos Vault. More information at Keberos Vault. |
INTEGRATION_NAME |
s3 , roboflow |
The name of the integration platform. Has to be 1 of the mentioned, more at integrations/ . |
RBF_API_KEY |
your_roboflow_key |
The API key for accessing Roboflow. Provide if INTEGRATION_NAME =roboflow otherwise leave empty. |
RBF_WORKSPACE |
your_roboflow_workspace |
The workspace name in Roboflow. Provide if INTEGRATION_NAME =roboflow otherwise leave empty. |
RBF_PROJECT |
your_roboflow_project |
The project name in Roboflow. Provide if INTEGRATION_NAME =roboflow otherwise leave empty. |
S3_ENDPOINT |
your_s3_endpoint |
The endpoint for the S3-compatible storage service. Provide if INTEGRATION_NAME =s3 otherwise leave empty. |
S3_ACCESS_KEY |
your_s3_acess_key |
The access key for the S3-compatible storage service. Provide if INTEGRATION_NAME =s3 otherwise leave empty. |
S3_SECRET_KEY |
your_s3_secret_key |
The secret key for the S3-compatible storage service. Provide if INTEGRATION_NAME =s3 otherwise leave empty. |
S3_BUCKET |
your_s3_bucket |
The name of the bucket in the S3-compatible storage service. Provide if INTEGRATION_NAME =s3 otherwise leave empty. |
TIME_VERBOSE |
True , False |
Indicates whether detailed time information should be logged. |
LOGGING |
True , False |
Specifies if logging is enabled. |
CLASSIFICATION_FPS |
default 5 |
The frames per second for classification. |
CLASSIFICATION_THRESHOLD |
default 0.2 |
The confidence threshold for classification predictions. |
MAX_NUMBER_OF_PREDICTIONS |
default 100 |
The maximum number of predictions to be made. |
FRAMES_SKIP_AFTER_DETECT |
default 50 |
Number of frames to skip after a detection is made. |
MIN_DETECTIONS |
default 1 |
The minimum number of detections required to consider an object recognized. |
IOU |
default 0.85 |
The Intersection over Union (IoU) threshold for object detection in model.track() . More information at iou |