-
Notifications
You must be signed in to change notification settings - Fork 178
Getting started
Cascalog is hosted at the Clojars maven repo.
- Make sure you have java 1.6
export JAVA_OPTS=-Xmx768m
Read the tutorials and follow along in your REPL. Experiment with Cascalog's playground dataset. The tutorials can be found [here](Introductory Tutorial, Part 1) and [here](Introductory Tutorial, Part 2).
Nathan's tech talk at LinkedIn goes through an in-depth example of using Cascalog to perform a complex query on real-world data. Watching this talk in full is highly recommended.
The Cascalog api is defined in the api.clj and ops.clj source code files. The files are pretty short, and it's recommended that you read through those files and familiarize yourself with the API.
After you've gone through the tutorials, read through the documentation on this wiki.
Cascalog can be run from the REPL on your local machine. In this case Hadoop runs in "local mode" which just means it's completely in process. This is useful for experimentation and for doing local analysis with small datasets.
Cascalog comes with some "playground" datasets which are useful for learning how to use the tool. These datasets are used in the introductory tutorials and you can see them by looking at the playground.clj
file in the Cascalog source.
See this tutorial for information about developing and running a Cascalog query on a cluster.