The task involved web scraping the OLX website to collect information about bike advertisements. The primary objective was to distinguish between genuine and fake advertisements, aiding users in identifying potential fraud. The process included the use of the DBSCAN clustering algorithm and Streamlit for building a user interface.
- Web Scraping OLX: The process began by scraping OLX, a classifieds platform, to gather data related to bike advertisements. This data would include various attributes like price, description, location, and seller information.
- Fake AD Detection: To differentiate between genuine and fake ads, a classification or clustering approach was employed. DBSCAN, a density-based clustering algorithm, was chosen to group similar ads together and potentially identify fake ads as outliers.
- DBSCAN Clustering: DBSCAN was applied to the collected data, aiming to group similar ads into clusters based on their attributes. Ads that significantly deviated from the clusters could be considered potential fake ads or outliers.
- Streamlit User Interface: A user-friendly interface was created using Streamlit. Users could input the URL of a bike advertisement they wanted to check for authenticity.
- URL Analysis: The entered URL was processed to extract information about the specific bike ad.
- Prediction: Based on the clustering results, the system provided a prediction on whether the entered ad was genuine or potentially fake. Fake ads were likely treated as outliers by the DBSCAN algorithm.
- Visual Representation: A scatter plot visual was generated to display the clustering results. Genuine ads typically formed clusters, while fake ads might appear as isolated points (outliers) on the plot.
- Data Augmentation: The information from the analyzed URL, along with the authenticity prediction, was appended to the dataset. This process helped improve the model's accuracy over time as more data became available.
- User Feedback: User interactions with the Streamlit interface and their feedback may have been used to further refine the model's accuracy and the user experience.
In essence, this task combined web scraping, clustering using DBSCAN, and the development of a Streamlit interface to assist users in identifying potentially fraudulent bike advertisements on OLX. It provided a user-friendly way to evaluate ad authenticity and contributed to the dataset's enrichment for ongoing improvement in fraud detection.