Aspen

Aspen lets you search a large corpus of plain text files via the browser.

Powerful search query support through Elasticsearch query string syntax
Performs some basic cleanup of plaintext data and can extract document titles
Responsive UI that works on mobile
Runs in Docker

Getting Started using Docker Compose

1. Collect your documents

Put all your files in one place, like ~/ebooks/:

$ tree ~/ebooks
/Users/ian/ebooks
└── Project\ Gutenberg/
    ├── Beowulf.txt
    ├── Dracula.txt
    ├── Frankenstein.txt

2. Run Aspen & Elasticsearch

$ docker-compose up -d
Creating network "aspen_default" with the default driver
Creating elasticsearch ... done
Creating aspen         ... done

3. Convert any non-plaintext (PDFs, MS Word) documents to plaintext

Use the included convert utility, which wraps Apache Tika, to convert them to plaintext. Pass it a filename relative to your data directory:

$ ls ~/ebooks
Project Gutenberg Test.docx

$ docker-compose run aspen convert Test.docx
Starting elasticsearch ... done
Test.docx doesn't exist, trying /data/Test.docx
Creating /data/Test.txt...
...
OK

$ ls ~/ebooks
Project Gutenberg Test.docx         Test.txt

4. Import content into Elasticsearch

Start by resetting Elasticsearch to make sure everything is working:

$ docker-compose run aspen es-reset
Starting elasticsearch ... done
Results from DELETE: { acknowledged: true }
✓ Done.

Now import all .txt documents. The import script will try to figure out the title of the document automatically:

$ docker-compose run aspen import
Starting elasticsearch ... done
→ Base directory is /app/public/data
▲ Ignoring non-text path: Test.docx
→ Test.txt → Test Document
→ Project Gutenberg/Beowulf.txt → The Project Gutenberg EBook of Beowulf
→ Project Gutenberg/Dracula.txt → The Project Gutenberg EBook of Dracula, by Bram Stoker
→ Project Gutenberg/Frankenstein.txt → Project Gutenberg's Frankenstein, by Mary Wollstonecraft (Godwin) Shelley
✓ Done!

You can also run import with a directory or file name relative to the data directory. For example, import Project\ Gutenberg or import Project\ Gutenberg\Dracula.txt.

Sometimes plaintext documents act strangely. Maybe bin/import can't extract a title or maybe the search highlights are off. The file might have the wrong line endings or one of those annoying UTF-8 BOM headers. Try running dos2unix on your text files to fix them.

5. Done!

Go to http://localhost:3000/ and start searching!

Development Setup

1. Install dependencies

It's easiest to use Elasticsearch via Docker.

You can get Node and Yarn via Homebrew on Mac, or you can download Node.js v8.5 or later and npm install -g yarn to get Yarn.

For document conversation (bin/convert) you'll want:

On macOS you can brew install node tika unrtf par.

2. Clone the repo

$ git clone git@github.com:statico/aspen.git
$ cd aspen
$ yarn install

3. Set up Elasticsearch and import your data

See steps 1-4 in the above "Using Docker" section. In short, get your text files together in one place, set up Elasticsearch, and import them with the bin/import command.

4. Start the web app

Aspen is built using Next.js, which is Node + ES6 + Express + React + hot reloading + lots more. Simply run:

$ yarn run dev

...and go to http://localhost:3000

If you are working on server.js and want automatic server restarting, do:

$ yarn global add nodemon
$ nodemon -w server.js -w lib -x yarn -- run dev

Development Notes

This started as an Angular 1 + CoffeeScript example. I recently migrated it to use Next.js, ES6 and React. You can view a full diff here.
I'm still using Elasticsearch 1.7 because I haven't bothered to learn the newer versions.

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
.github/workflows		.github/workflows
.husky		.husky
bin		bin
components		components
config		config
lib		lib
pages		pages
public		public
.dockerignore		.dockerignore
.gitignore		.gitignore
.prettierignore		.prettierignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
server.js		server.js
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aspen

Getting Started using Docker Compose

1. Collect your documents

2. Run Aspen & Elasticsearch

3. Convert any non-plaintext (PDFs, MS Word) documents to plaintext

4. Import content into Elasticsearch

5. Done!

Development Setup

1. Install dependencies

2. Clone the repo

3. Set up Elasticsearch and import your data

4. Start the web app

Development Notes

Links

About

Packages

Contributors 2

Languages

License

statico/aspen

Folders and files

Latest commit

History

Repository files navigation

Aspen

Getting Started using Docker Compose

1. Collect your documents

2. Run Aspen & Elasticsearch

3. Convert any non-plaintext (PDFs, MS Word) documents to plaintext

4. Import content into Elasticsearch

5. Done!

Development Setup

1. Install dependencies

2. Clone the repo

3. Set up Elasticsearch and import your data

4. Start the web app

Development Notes

Links

About

Topics

Resources

License

Stars

Watchers

Forks

Packages 0

Contributors 2

Languages

Packages