Skip to content

Default template for general web scraping project.

License

Notifications You must be signed in to change notification settings

sonlia/imscrape-template

This branch is up to date with oiwn/imscrape-template:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
istinspring
May 25, 2015
930732b · May 25, 2015

History

8 Commits
May 12, 2015
May 24, 2015
May 24, 2015
May 24, 2015
May 11, 2015
May 24, 2015
May 24, 2015
May 25, 2015
May 24, 2015
May 24, 2015
May 24, 2015
May 24, 2015
May 24, 2015
May 24, 2015

Repository files navigation

What is it?

Default template for my scraping project. Usually it's few scrapers + REST API on top for JavaScript/Mobile frontend or another service to consume.

Depends on

  • MongoDB
  • Redis

Featured by

How to use

clone repo

git clone https://github.com/istinspring/imscrape-template
cd imscrape-template

install project dependencies

pip install -r requirements.txt

run test scraper

python cli.py -T github

run REST api

python api.py

Complex run with celery

Run the api and celery using Procfile

honcho start

Run crawler as a celery task passing -c options into the cli.py (command line interface) script

python cli.py -T github -c

and open http://localhost:8000 (both json and xml supported via content-type header)

<resource href="github_favorites" title="github_favorites">
    <link rel="last" href="github_favorites?page=3" title="last page"/>
    <link rel="next" href="github_favorites?page=2" title="next page"/>
    <link rel="parent" href="/" title="home"/>
    <_meta>
        <max_results>10</max_results>
        <page>1</page>
        <total>25</total>
    </_meta>

    <resource href="github_favorites/555179e74dc7822d62abc2b5" title="Github_favorite">
        <_created>Tue, 12 May 2015 03:56:23 GMT</_created>
        <_etag>83dc8829ef61dfeb52111e04cb04a6e534b36810</_etag>
        <_id>555179e74dc7822d62abc2b5</_id>
        <_updated>Tue, 12 May 2015 03:56:23 GMT</_updated>
        <author>https://github.com/square</author>
        <commits>29</commits>
        <forks>120</forks>
        <repo_name>leakcanary</repo_name>
        <source_url>https://github.com/square/leakcanary</source_url>
        <stars>2116</stars>
        <watchers>155</watchers>
    </resource>

    ...
</resource>

TODO

  • database post/update via. internal eve (+ validate data)
  • add honcho
  • init settings.py constants from the environment variables
  • add celery
  • add makefile

About

Default template for general web scraping project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.7%
  • Makefile 1.3%