Skip to content

Commit

Permalink
nbc headlines updates
Browse files Browse the repository at this point in the history
  • Loading branch information
asg017 committed Oct 2, 2024
1 parent 496560c commit f43ae7a
Show file tree
Hide file tree
Showing 3 changed files with 1,133 additions and 541 deletions.
22 changes: 22 additions & 0 deletions examples/nbc-headlines/1_scrape.ipynb
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NBC News Headlines: Scraper\n",
"\n",
"This notebooks implements a scraper for [NBC News](https://www.nbcnews.com) headlines. It uses [this sitemap](https://www.nbcnews.com/archive/articles/2024/march), which provides a list of article headlines + URLs\n",
"for every month for the past few years. \n",
"\n",
"This dataset is mostly to get a simple, real-world small text dataset for testing embeddings. \n",
"They're small pieces of text (~dozen words), have a wide range of semantic meaning, and are more \"real-world\"\n",
"them some other embeddings datasets out there.\n",
"\n",
"This notebook uses [Deno](https://deno.com/), [linkedom](https://github.com/WebReflection/linkedom), and a few \n",
"SQLite extensions to scrape the headlines for a given date range. It creates a single SQL table, `articles`, \n",
"with a few columns like `headline` and `url`. By default it will get all article headlines from January 2024 -> present\n",
"and save them to a database called `headlines-2024.db`. Feel free to copy+paste this code into your own custom scraper. \n",
"\n",
"This notebook also just scrapes the data into a SQLite database, it does NOT do any embeddings + vector search. \n",
"For those examples of those, see [`./2_build.ipynb`](./2_build.ipynb) and [`./3_search.ipynb`](./3_search.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 43,
Expand Down
Loading

0 comments on commit f43ae7a

Please sign in to comment.