Skip to content

Commit

Permalink
doc: Update README.md
Browse files Browse the repository at this point in the history
Adding a few more details and a link to article.
  • Loading branch information
philippe2803 authored Apr 29, 2024
1 parent b5dd0f3 commit 62ac810
Showing 1 changed file with 15 additions and 5 deletions.
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ A way to share content from a specific domain using SQLite as an alternative to
RSS feeds. The purpose of this library is to simply create a dataset for all the
content on your website, using the XML sitemap as a starting point.

Possibility to include vector search similarity features in the dataset very easily.

Article that explains the rationale behind this type of datasets [here](https://philippeoger.com/pages/can-we-rag-the-whole-web/).


## Installation

Expand All @@ -15,15 +19,21 @@ pip install contentmap

## Quickstart

To build your contentmap.db that will contain all your content using your XML
sitemap as a starting point, you only need to write the following:
To build your contentmap.db with vector search capabilities and containing all
your content using your XML sitemap as a starting point, you only need to write the
following:

```python
from contentmap.sitemap import SitemapToContentDatabase

database = SitemapToContentDatabase("https://yourblog.com/sitemap.xml")
database.load()
database = SitemapToContentDatabase(
sitemap_url="https://yourblog.com/sitemap.xml",
concurrency=10,
include_vss=True
)
database.build()

```

You can control how many urls can be crawled concurrently and also set some timeout.
This will automatically create the SQLite database file, with vector search
capabilities (piggybacking on sqlite-vss integration on Langchain).

0 comments on commit 62ac810

Please sign in to comment.