-
Notifications
You must be signed in to change notification settings - Fork 36
Home
SpookyStuff is a fast and simple query engine for web scraping/data enrichment/acceptance QA. It aims to allow unstructured web resources being queried and linked like a relational database.
SpookyStuff is the fastest and most scalable of its kind, with a speed record of querying 330404 dynamic pages per hour on 300 cores.
SpookyStuff is tightly integrated with Spark ecosystem and can export structured data directly as RDD and Spark SQL table.
- Apache Spark
- Selenium
- GhostDriver/PhantomJS
- JSoup
- Apache Tika
- (build by) Apache Maven
- Scala/ScalaTest plugins
- (deployed by) Ansible
- Current implementation is influenced by Spark SQL and Mahout Sparkbinding.
Click me for a quick impression.
This environment is deployed on a Spark cluster with 8+ cores. It may not be accessible during system upgrade or maintenance. Please contact a committer/project manager for a customized demo.
Copyright © 2014 by Peng Cheng @tribbloid, Sandeep Singh @techaddict, Terry Lin @ithinkicancode, Long Yao @l2yao and contributors.
Published under ASF License, see LICENSE.