paperweight

paperweight backs up and extracts data from the pdf links in your markdown files

script running in CLI:

dash showing embeddings in 3d space:

sqlite db (shown in Beekeper Studio Ultimate)

features

finds links to pdfs in markdown files
backup the pdf's in a sqlite database
full pdf saved in a blob
text extraction via https://github.com/pymupdf/PyMuPDF
text embeddings from openai stored as encoded JSON
metadata via gpt-3.5 turbo functions
- title
- keywords
- authors
- abstract
- published_date
- summary
- institution
- location
- doi
screenshot of first page saved as blob
3d viz of embeddings via dash and plotly
cloud backup via cloudflare r2 or amazon s3

limitations

embeddings are currently limited to the first 8191 tokens of the pdf. this is the max input size of text-embedding-3-small. chunking the text and sending it in parts to support full embeddings is a future feature.

(2/15/24 update) - perhaps huge context windows are around the corner anyways

full text is currently limited to 10mb per row. this is arbitrary and will be configurable in the future.

Usage

run via the command line with the following command

python main.py --directory ~/path/to/your/mds

the dash will be accessible on http://127.0.0.1:8050/. see the REMAIN_OPEN arg below for keeping the dash running when processing is complete.

CLI Args

--directory - path to the directory containing markdown files. defaults to the value of the DIRECTORY_NAME environment variable or the current directory if not set
--db-name - the name for the database. defaults to the value of the DB_NAME environment variable or papers.db if not specified
--model-name - specifies the OpenAI model name to be used. defaults to the value of the MODEL_NAME environment variable or gpt-3.5-turbo-0125 if not provided
--verbose - enables verbose mode, providing detailed logging. this defaults to the boolean value of the VERBOSE environment variable or False if not set
--remain-open - keeps the application running even after processing is complete, useful for continuous operation or debugging. this defaults to the boolean value of the REMAIN_OPEN environment variable or False if not specified

Environment Variables

To enhance security and flexibility, certain configurations are managed through environment variables:

OPENAI_API_KEY - your OpenAI API key, required for generating embeddings and extracting data. this is not explicitly called for anywhere in the application code, but is rather automagically used by the openai library.
DIRECTORY_NAME - (optional) can be set to define a default directory for --directory, overriding the default current directory
DB_NAME - (optional) sets a default database name for --db-name, overriding the default papers.db
MODEL_NAME - (optional) determines the default model name for --model-name, if not specified via CLI, defaulting to gpt-3.5-turbo-0125
VERBOSE - (optional) can be set to true to enable verbose mode by default, overriding the CLI --verbose flag
REMAIN_OPEN - (optional) when set to true, the application remains open after processing, overriding the --remain-open CLI flag. this is used to continue looking at the dash app after processing is complete.

more env vars for cloud backuop

the following environment variables are used for cloud backup functionality:

S3_BUCKET_NAME - specifies the S3 bucket name where backups are stored
S3_ENDPOINT_URL - the endpoint URL for S3 services
AWS_ACCESS_KEY_ID - your AWS access key ID
AWS_SECRET_ACCESS_KEY - your AWS secret access key
S3_REGION_NAME - defines the AWS region for the S3 service. defaults to auto if not explicitly set, allowing automatic determination based on the endpoint URL

.env example

To configure your script with environment variables, you can use a .env file. Here's an example that you can customize:

# Application Configuration
DIRECTORY_NAME=./path/to/markdowns
DB_NAME=papers.db
MODEL_NAME=gpt-3.5-turbo-0125
VERBOSE=true
REMAIN_OPEN=false

# AWS S3 Configuration
S3_BUCKET_NAME=your_bucket_name
S3_ENDPOINT_URL=https://s3.your-region.amazonaws.com
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
S3_REGION_NAME=your_region_name

replace the placeholder values with your actual configuration details. this file should be named .env and should not be committed to version control for security reasons (people can steal your openai key). this is why .env is in the .gitignore file.

requirements

openai api key
python 3.11.1
- probably works with 3.12, but has not been tested
for cloud backup you will need the following. see https://developers.cloudflare.com/r2/examples/aws/boto3/ for more information on how to set up your keys and what the endpoint_url should be:
- endpoint_url
- access_key
- secret_key

see usage above for how to configure these as environment variables.

cloud backup

you can optionally setup a cloud backup of the sqlite database to aws s3/cloudflare r2.

see usage section for detailed instructions on how to set up the backup.

I evaluated using turso embedded replicas for this, but turso does not seem to support BLOB columns, so I did not think it would be a good fit. The primary purpose of these backups is so your PDFs do not dissapear to link rot, so the blob columns are important.

if you want to turn the non-blob columns into a distributed database, perhaps to create an API to serve your papers with a JS based frontend, you might want to consider using turso. I have not tried this, but it seems like it would be a good fit for that use case.

WAL?

I did not run into any issues serving the dash app locally which pulls from the sqlite database file, as papers are inserted into the same database in a separate thread.

if I were to run into issues, turning on WAL mode would probably be the first thing I would try.

if you are running into issues, or want to pull from the database more intensely, you might want to consider turning on WAL mode. The main downside seems to be the creation of two more files.

prompt engineering

results can be improved by better prompt engineering extractor in models.py

coming soon (?)

n-shot training to extract data better
turbo mode with many modal containers
support more link types
- arxiv
- wikipedia
support local pdf files
backing up full text and embeddings
make embeddings and NER optional
modularize the existing functionality to reduce core dependencies
- dash server is a separate service
- cloud backup is a separate service
test coverage
datasette plugin
gptstore plugin (?)
service to run in the cloud on different data sources

thank you

inspired by and hopefully mostly compatible with

inpired by varepsilon's rsrch.space

inspired to work with files over apps by kepano

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.gitignore		.gitignore
README.md		README.md
cloud_backup.py		cloud_backup.py
dash_app.py		dash_app.py
db.py		db.py
link_extractor.py		link_extractor.py
main.py		main.py
models.py		models.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
text_extractor.py		text_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paperweight

features

limitations

Usage

CLI Args

Environment Variables

more env vars for cloud backuop

.env example

requirements

cloud backup

WAL?

prompt engineering

coming soon (?)

thank you

About

Releases

Packages

Languages

phonetonote/paperweight

Folders and files

Latest commit

History

Repository files navigation

paperweight

features

limitations

Usage

CLI Args

Environment Variables

more env vars for cloud backuop

.env example

requirements

cloud backup

WAL?

prompt engineering

coming soon (?)

thank you

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages