Skip to content

A research on AI-based solution to automatically extract product image from any ecommerce site

Notifications You must be signed in to change notification settings

aphuongle95/SmartImageExtractionEcommerce

Repository files navigation

Extracting Images from Ecommerce Sites

This project emphasizes leveraging LLMs and ChatGPT for effective prompting to enhance image extraction precision.

Note

This is an ongoing research project, so the code may not be very clean.

Objective

The goal of this project is to extract product images from e-commerce product pages while excluding irrelevant images such as logos or similar product images. This requires handling various languages and filtering based on text content.

Planned Steps

  1. Initial Setup

    • Display HTML cleanly (show only tag names, text, and image links)
    • Extract all headers
    • Remove irrelevant tags (e.g., script, header, footer)
  2. HTML Cleaning

    • Retain all images
    • Retain all headers
    • Delete elements without images or headers
  3. Identifying Product Images

    • OpenAI-based Approach
      • Use OpenAI to identify the product image class:
        • First experiment: compress the tree to minimize tokens, recursively feed the API with cleaned text, and return image links as an array
        • Restructure code: from website link to BeautifulSoup processing to prompt generation and link extraction
        • Retain only crucial information (image links, header texts, class names)
        • Remove duplicates (if class + text/link is similar to the previous entry)
        • Avoid batching, keep prompts short
        • Fix issues (e.g., image source in data-src attribute on Jumia)
        • Test with other e-commerce sites
        • Reject small images (<70 pixels)
        • Use title and surrounding text in the prompt
      • Selenium-based Approach
        • Identify the largest image, its class, and adjacent images (consider CSS)
        • Integrate findings into the prompt
      • Use Crawlbase for proper website loading
      • Reject images without the same parent div
      • Compare OpenAI and non-OpenAI algorithms for precision
        • Observations:
          • Direct inspection of cleaned prompts may suffice without ChatGPT
          • Simplified prompts might directly yield images
    • Extra Step: Implement image classification to:
      • Differentiate between logos and product images
      • Determine if product images belong to similar or complementary products
  4. Final Steps

    • Verify images within identified elements

About

A research on AI-based solution to automatically extract product image from any ecommerce site

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages