Extracting Images from Ecommerce Sites

This project emphasizes leveraging LLMs and ChatGPT for effective prompting to enhance image extraction precision.

Note

This is an ongoing research project, so the code may not be very clean.

Objective

The goal of this project is to extract product images from e-commerce product pages while excluding irrelevant images such as logos or similar product images. This requires handling various languages and filtering based on text content.

Planned Steps

Initial Setup
- Display HTML cleanly (show only tag names, text, and image links)
- Extract all headers
- Remove irrelevant tags (e.g., script, header, footer)
HTML Cleaning
- Retain all images
- Retain all headers
- Delete elements without images or headers
Identifying Product Images
- OpenAI-based Approach
  - Use OpenAI to identify the product image class:
    - First experiment: compress the tree to minimize tokens, recursively feed the API with cleaned text, and return image links as an array
    - Restructure code: from website link to BeautifulSoup processing to prompt generation and link extraction
    - Retain only crucial information (image links, header texts, class names)
    - Remove duplicates (if class + text/link is similar to the previous entry)
    - Avoid batching, keep prompts short
    - Fix issues (e.g., image source in data-src attribute on Jumia)
    - Test with other e-commerce sites
    - Reject small images (<70 pixels)
    - Use title and surrounding text in the prompt
  - Selenium-based Approach
    - Identify the largest image, its class, and adjacent images (consider CSS)
    - Integrate findings into the prompt
  - Use Crawlbase for proper website loading
  - Reject images without the same parent div
  - Compare OpenAI and non-OpenAI algorithms for precision
    - Observations:
      - Direct inspection of cleaned prompts may suffice without ChatGPT
      - Simplified prompts might directly yield images
- Extra Step: Implement image classification to:
  - Differentiate between logos and product images
  - Determine if product images belong to similar or complementary products
Final Steps
- Verify images within identified elements

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
image_extract		image_extract
image_extract_files		image_extract_files
image_extract_tests		image_extract_tests
.gitignore		.gitignore
README.md		README.md
image extract experiments.ipynb		image extract experiments.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extracting Images from Ecommerce Sites

Note

Objective

Planned Steps

About

Releases

Packages

Languages

aphuongle95/SmartImageExtractionEcommerce

Folders and files

Latest commit

History

Repository files navigation

Extracting Images from Ecommerce Sites

Note

Objective

Planned Steps

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages