Aspen lets you search a large corpus of plain text files via the browser.
- Powerful search query support through Elasticsearch query string syntax
- Performs some basic cleanup of plaintext data and can extract document titles
- Responsive UI that works on mobile
- Runs in Docker
Put all your files in one place, like ~/ebooks/
:
$ tree ~/ebooks
/Users/ian/ebooks
βββ Project\ Gutenberg/
βββ Beowulf.txt
βββ Dracula.txt
βββ Frankenstein.txt
$ docker-compose up -d
Creating network "aspen_default" with the default driver
Creating elasticsearch ... done
Creating aspen ... done
Use the included convert
utility, which wraps Apache Tika, to convert them to plaintext. Pass it a filename relative to your data directory:
$ ls ~/ebooks
Project Gutenberg Test.docx
$ docker-compose run aspen convert Test.docx
Starting elasticsearch ... done
Test.docx doesn't exist, trying /data/Test.docx
Creating /data/Test.txt...
...
OK
$ ls ~/ebooks
Project Gutenberg Test.docx Test.txt
Start by resetting Elasticsearch to make sure everything is working:
$ docker-compose run aspen es-reset
Starting elasticsearch ... done
Results from DELETE: { acknowledged: true }
β Done.
Now import all .txt
documents. The import
script will try to figure out the title of the document automatically:
$ docker-compose run aspen import
Starting elasticsearch ... done
β Base directory is /app/public/data
β² Ignoring non-text path: Test.docx
β Test.txt β Test Document
β Project Gutenberg/Beowulf.txt β The Project Gutenberg EBook of Beowulf
β Project Gutenberg/Dracula.txt β The Project Gutenberg EBook of Dracula, by Bram Stoker
β Project Gutenberg/Frankenstein.txt β Project Gutenberg's Frankenstein, by Mary Wollstonecraft (Godwin) Shelley
β Done!
You can also run import
with a directory or file name relative to the data directory. For example, import Project\ Gutenberg
or import Project\ Gutenberg\Dracula.txt
.
Sometimes plaintext documents act strangely. Maybe bin/import
can't extract a title or maybe the search highlights are off. The file might have the wrong line endings or one of those annoying UTF-8 BOM headers. Try running dos2unix on your text files to fix them.
Go to http://localhost:3000/ and start searching!
It's easiest to use Elasticsearch via Docker.
You can get Node and Yarn via Homebrew on Mac, or you can download Node.js v8.5 or later and npm install -g yarn
to get Yarn.
For document conversation (bin/convert
) you'll want:
On macOS you can brew install node tika unrtf par
.
$ git clone git@github.com:statico/aspen.git
$ cd aspen
$ yarn install
See steps 1-4 in the above "Using Docker" section. In short, get your text files together in one place, set up Elasticsearch, and import them with the bin/import
command.
Aspen is built using Next.js, which is Node + ES6 + Express + React + hot reloading + lots more. Simply run:
$ yarn run dev
...and go to http://localhost:3000
If you are working on server.js
and want automatic server restarting, do:
$ yarn global add nodemon
$ nodemon -w server.js -w lib -x yarn -- run dev
- This started as an Angular 1 + CoffeeScript example. I recently migrated it to use Next.js, ES6 and React. You can view a full diff here.
- I'm still using Elasticsearch 1.7 because I haven't bothered to learn the newer versions.