nbc headlines updates

asg017 · Oct 2, 2024 · f43ae7a · f43ae7a
1 parent 496560c
commit f43ae7a
Show file tree

Hide file tree

Showing 3 changed files with 1,133 additions and 541 deletions.
diff --git a/examples/nbc-headlines/1_scrape.ipynb b/examples/nbc-headlines/1_scrape.ipynb
@@ -1,5 +1,27 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# NBC News Headlines: Scraper\n",
+    "\n",
+    "This notebooks implements a scraper for [NBC News](https://www.nbcnews.com) headlines. It uses [this sitemap](https://www.nbcnews.com/archive/articles/2024/march), which provides a list of article headlines + URLs\n",
+    "for every month for the past few years. \n",
+    "\n",
+    "This dataset is mostly to get a simple, real-world small text dataset for testing embeddings. \n",
+    "They're small pieces of text (~dozen words), have a wide range of semantic meaning, and are more \"real-world\"\n",
+    "them some other embeddings datasets out there.\n",
+    "\n",
+    "This notebook uses [Deno](https://deno.com/), [linkedom](https://github.com/WebReflection/linkedom), and a few \n",
+    "SQLite extensions to scrape the headlines for a given date range. It creates a single SQL table, `articles`, \n",
+    "with a few columns like `headline` and `url`. By default it will get all article headlines from January 2024 -> present\n",
+    "and save them to a database called `headlines-2024.db`. Feel free to copy+paste this code into your own custom scraper. \n",
+    "\n",
+    "This notebook also just scrapes the data into a SQLite database, it does NOT do any embeddings + vector search. \n",
+    "For those examples of those, see [`./2_build.ipynb`](./2_build.ipynb) and [`./3_search.ipynb`](./3_search.ipynb)."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 43,