Scrape the Web with entities extraction using OpenAI Function

What is this?

This codebase allows you to scrape any website and extract relevant data points easily using OpenAI Functions and LangChain. Create a schema in schemas.py, pick a url, and use them with scrape_with_playwright() in main.py to start scraping.

Tip: each website has the bulk of content either in <p>, <span> or <h> tags. For best performance, choose a combination of tags that work for you.

Example

Define the schema of the website you want to scrape in schemas.py (Pydantic class or dictionary are both fine):
```
class SchemaNewsWebsites(BaseModel):
    news_headline: str
    news_short_summary: str
```

To start scraping, in main.py, run something like this:

asyncio.run(scrape_with_playwright(
        url="https://www.bbc.com",
        tags=["span"],
        schema_pydantic=SchemaNewsWebsites
    ))

Setup

1. Create a new Python virtual environment

python -m venv virtual-env or python3 -m venv virtual-env (Mac)

py -m venv virtual-env (Windows 11)

2. Activate virtual environment

.\virtual-env\Scripts\activate (Windows)

source virtual-env/bin/activate (Mac)

3. Install dependencies using Poetry

Run poetry install --sync or poetry install

4. Install playwright

playwright install

5. Create a new `.env` file to store OpenAI's API key

OPENAI_API_KEY=XXXXXX

Usage

Run locally

python main.py

Additional Information

Add onto this a FastAPI server to serve this as an API endpoint for ease of use.
Use caution when scraping. Don't do anything I wouldn't do (illegal)
P.S I've added this functionality to LangChain in this PR. You can read the official docs here.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.vscode		.vscode
.gitignore		.gitignore
README.md		README.md
ai_extractor.py		ai_extractor.py
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
schemas.py		schemas.py
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape the Web with entities extraction using OpenAI Function

What is this?

Example

Setup

1. Create a new Python virtual environment

2. Activate virtual environment

3. Install dependencies using Poetry

4. Install playwright

5. Create a new `.env` file to store OpenAI's API key

Usage

Run locally

Additional Information

About

Releases

Packages

Languages

trancethehuman/entities-extraction-web-scraper

Folders and files

Latest commit

History

Repository files navigation

Scrape the Web with entities extraction using OpenAI Function

What is this?

Example

Setup

1. Create a new Python virtual environment

2. Activate virtual environment

3. Install dependencies using Poetry

4. Install playwright

5. Create a new .env file to store OpenAI's API key

Usage

Run locally

Additional Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

5. Create a new `.env` file to store OpenAI's API key

Packages