GitHub

Work in progress

A static snapshot scraper/server for SEO in single page hash-bang applications (like those built with AngularJS).

To install

Clone the repo
npm install
sudo npm install -g grunt-cli
make sure phantomjs is in your path (e.g. by sudo npm install -g phantomjs)
bower install (to get bower do sudo npm install -g bower)
sudo npm install -g typescript
grunt && node app.js

What it does

It keeps a static cache of rendered HTML pages for dynamic routes to allow for searchability of single page applications. Read some background here: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started

The app runs a web server with two routes. Fire a POST request to /scrape with route in the payload set to whatever fragment you have appearing after the hashbang. The fire a GET request to basePath?_escaped_fragment_=fragment and you'll be served the scraped content.

Concretely, say your config.json file (in the top level directory; gitignored; for defaults see scripts/config.ts) contains:

{
	"baseUrl":   "http://my-awesome-app.com/index.html",
	"basePath":  "/index.html"
}

Fire up the app and issue the following HTTP request to scrape

POST localhost:3000/scrape with body route=/item/123

This will go to http://my-awesome-app.com/index.html/#!/item/123, scrape the content, and save it on disk. Now to see the scraped content that the search engine would see, hit up the following URL

localhost:3000/index.html?_escaped_fragment_=/item/123

And you should see your content. Obviously without CSS/JS, since we scraped only the HTML, and are currently serving stuff from a different domain.

Wire it all up with varnish to make sure things are served from the same domain. Then schedule whatever content management system you have backing your application to fire POST scrape requests. Maybe you want to pair these requests with your sitemap generation in some nice way.

Still a work in progress, will clean things up as I go.

TODO:

allow the user to provide named completenessDetection functions, and submit a name with the scrape request POST. That way, we know when it's safe to scrape the content (e.g. Ajax calls all came back, content is stable, etc)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scripts		scripts
.bowerrc		.bowerrc
.gitignore		.gitignore
Gruntfile.js		Gruntfile.js
README.md		README.md
app.js		app.js
component.json		component.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

To install

What it does

About

Releases

Packages

anchann/namacha

Folders and files

Latest commit

History

Repository files navigation

To install

What it does

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages