This project emphasizes leveraging LLMs and ChatGPT for effective prompting to enhance image extraction precision.
This is an ongoing research project, so the code may not be very clean.
The goal of this project is to extract product images from e-commerce product pages while excluding irrelevant images such as logos or similar product images. This requires handling various languages and filtering based on text content.
-
Initial Setup
- Display HTML cleanly (show only tag names, text, and image links)
- Extract all headers
- Remove irrelevant tags (e.g., script, header, footer)
-
HTML Cleaning
- Retain all images
- Retain all headers
- Delete elements without images or headers
-
Identifying Product Images
- OpenAI-based Approach
- Use OpenAI to identify the product image class:
- First experiment: compress the tree to minimize tokens, recursively feed the API with cleaned text, and return image links as an array
- Restructure code: from website link to BeautifulSoup processing to prompt generation and link extraction
- Retain only crucial information (image links, header texts, class names)
- Remove duplicates (if class + text/link is similar to the previous entry)
- Avoid batching, keep prompts short
- Fix issues (e.g., image source in data-src attribute on Jumia)
- Test with other e-commerce sites
- Reject small images (<70 pixels)
- Use title and surrounding text in the prompt
- Selenium-based Approach
- Identify the largest image, its class, and adjacent images (consider CSS)
- Integrate findings into the prompt
- Use Crawlbase for proper website loading
- Reject images without the same parent div
- Compare OpenAI and non-OpenAI algorithms for precision
- Observations:
- Direct inspection of cleaned prompts may suffice without ChatGPT
- Simplified prompts might directly yield images
- Observations:
- Use OpenAI to identify the product image class:
- Extra Step: Implement image classification to:
- Differentiate between logos and product images
- Determine if product images belong to similar or complementary products
- OpenAI-based Approach
-
Final Steps
- Verify images within identified elements