csvget (also, jsonget)¶ ↑

What’s new in 0.3.0¶ ↑

Added command-line option:

--filter=RUBY_CODE RUBY_CODE will be eval'd in context of @row.is_a?(FasterCSV::Row)

# Example:
./bin/csvget --require=chronic --require=time --filter "@row['time']=Chronic.parse(@row['time']).iso8601" # ...

Added CHANGELOG

Dependencies¶ ↑

github.com/fizx/parsley/tree/master and its dependencies.
Rubygems

Running on EC2¶ ↑

> git clone git://github.com/fizx/csvget-ec2-recipe.git
> cd csvget-ec2-recipe
> ./boot.rb
> ssh YOUR_INSTANCE

Local Installation¶ ↑

Install the dependencies.
> gem sources -a gems.github.com
> sudo gem install fizx-csvget

Example Usage¶ ↑

> cat hn.let
{
  "headlines":[{
    "title": ".title a",
    "link": ".title a @href",
    "comments": "match(.subtext a:nth-child(3), '\\d+')",
    "user": ".subtext a:nth-child(2)",
    "score": "match(.subtext span, '\\d+')",
    "time": "match(.subtext, '\\d+\\s+\\w+\\s+ago')"
  }]
}
> csvget --directory-prefix=./data  -A "/x" -w 5 --parselet=hn.let http://news.ycombinator.com/
> head data/headlines.csv
comments,title,time,link,score,user
4,Simpson's paradox: why mistrust seemingly simple statistics,2 hours ago,http://en.wikipedia.org/wiki/Simpson%27s_paradox,41,waldrews
67,America's unjust sex laws,2 hours ago,http://www.economist.com/opinion/displaystory.cfm?story_id=14165460,59,MikeCapone
23,Buy somebody lunch,3 hours ago,http://www.whattofix.com/blog/archives/2009/08/buy-somebody-lu.php,58,DanielBMarkham
10,A design pattern is an artifact of a missing feature in your chosen language,3 hours ago,http://www.snell-pym.org.uk/archives/2008/12/29/design-patterns/,31,bensummers
4,API changes in Snow Leopard,1 hour ago,http://developer.apple.com/mac/library/releasenotes/MacOSX/WhatsNewInOSX/Articles/MacOSX10_6.html#//apple_ref/doc/uid/TP40008898-SW1,14,pieter
16,How to run a linux based home web server,3 hours ago,http://stevehanov.ca/blog/index.php?id=73,28,RiderOfGiraffes
1,"OpenCL ""Hello World""",1 hour ago,"http://developer.apple.com/mac/library/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Example:Hello,World/Example:Hello,World.html",8,pieter
15,US Senate bill allows White House to disconnect private computers from Internet,4 hours ago,http://news.cnet.com/8301-13578_3-10320096-38.html,35,drewr
1,Strategy: Solve Only 80 Percent of the Problem,47 minutes ago,http://highscalability.com/strategy-solve-only-80-percent-problem,6,alrex021
> csvget -h
Usage: ./bin/csvget [options] SEED_URL [SEED_URL2 ...]
        --parselet=JSON_FILE         JSON_FILE is a parselet.
    -w, --wait=SECONDS               wait SECONDS between retrievals.
    -P, --directory-prefix=PREFIX    save files to PREFIX/...
    -U, --user-agent=AGENT           identify as AGENT instead of RWget/VERSION.
    -A, --accept-pattern=RUBY_REGEX  URLs must match RUBY_REGEX to be saved to the queue.
        --time-limit=AMOUNT          Crawler will stop after this AMOUNT of time has passed.
    -R, --reject-pattern=RUBY_REGEX  URLs must NOT match RUBY_REGEX to be saved to the queue.
        --require=RUBY_SCRIPT        Will execute 'require RUBY_SCRIPT'
        --limit-rate=RATE            limit download rate to RATE.
        --http-proxy=URL             Proxies via URL
        --proxy-user=USER            Sets proxy user to USER
        --proxy-password=PASSWORD    Sets proxy password to PASSWORD
        --fetch-class=RUBY_CLASS     Must implement fetch(uri, user_agent_string) #=> [final_redirected_url, file_object]
        --store-class=RUBY_CLASS     Must implement put(key_string, temp_file)
        --dupes-class=RUBY_CLASS     Must implement dupe?(uri)
        --queue-class=RUBY_CLASS     Must implement put(key_string, depth_int) and get() #=> [key_string, depth_int]
        --links-class=RUBY_CLASS     Must implement urls(base_uri, temp_file) #=> [uri, ...]
    -S, --sitemap=URL                URL of a sitemap to crawl (will ignore inter-page links)
    -V, --version
    -Q, --quota=NUMBER               set retrieval quota to NUMBER.
        --max-redirect=NUM           maximum redirections allowed per page.
    -H, --span-hosts                 go to foreign hosts when recursive
        --connect-timeout=SECS       set the connect timeout to SECS.
    -T, --timeout=SECS               set all timeout values to SECONDS.
    -l, --level=NUMBER               maximum recursion depth (inf or 0 for infinite).
        --[no-]timestampize          Prepend the timestamp of when the crawl started to the directory structure.
        --incremental-from=PREVIOUS  Build upon the indexing already saved in PREVIOUS.
        --protocol-directories       use protocol name in directories.
        --no-host-directories        don't create host directories.
    -v, --[no-]verbose               Run verbosely
    -h, --help                       Show this message

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
bin		bin
lib		lib
test		test
.gitignore		.gitignore
CHANGELOG		CHANGELOG
LICENSE		LICENSE
README.rdoc		README.rdoc
Rakefile		Rakefile
VERSION		VERSION
csvget.gemspec		csvget.gemspec
hn.let		hn.let

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

csvget (also, jsonget)¶ ↑

What’s new in 0.3.0¶ ↑

Dependencies¶ ↑

Running on EC2¶ ↑

Local Installation¶ ↑

Example Usage¶ ↑

Copyright¶ ↑

About

Releases

Packages

Languages

License

fizx/csvget

Folders and files

Latest commit

History

Repository files navigation

csvget (also, jsonget)¶ ↑

What’s new in 0.3.0¶ ↑

Dependencies¶ ↑

Running on EC2¶ ↑

Local Installation¶ ↑

Example Usage¶ ↑

Copyright¶ ↑

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages