This codebase allows you to scrape any website and extract relevant data points easily using OpenAI Functions and LangChain.
Create a schema in schemas.py
, pick a url, and use them with scrape_with_playwright()
in main.py
to start scraping.
Tip: each website has the bulk of content either in <p>
, <span>
or <h>
tags. For best performance, choose a combination of tags that work for you.
-
Define the schema of the website you want to scrape in
schemas.py
(Pydantic class or dictionary are both fine):class SchemaNewsWebsites(BaseModel): news_headline: str news_short_summary: str
-
To start scraping, in
main.py
, run something like this:asyncio.run(scrape_with_playwright( url="https://www.bbc.com", tags=["span"], schema_pydantic=SchemaNewsWebsites ))
python -m venv virtual-env
or python3 -m venv virtual-env
(Mac)
py -m venv virtual-env
(Windows 11)
.\virtual-env\Scripts\activate
(Windows)
source virtual-env/bin/activate
(Mac)
Run poetry install --sync
or poetry install
playwright install
OPENAI_API_KEY=XXXXXX
python main.py
-
Add onto this a FastAPI server to serve this as an API endpoint for ease of use.
-
Use caution when scraping. Don't do anything I wouldn't do (illegal)
-
P.S I've added this functionality to LangChain in this PR. You can read the official docs here.