Releases: paulhoule/infovore
Transition to Jena 2.12
Transition to Hadoop 2.0
The weekly job now runs in Hadoop 2 on EMR. The 3.0 is series is experimental; 4.0 will happen when either things settle down or if there is another breaking change of dependencies.
Fall 2014 BaseKB Prototype
This release contains a number of changes to the : BaseKB output, most importantly:
- $-escapes are now converted to Unicode in keys and almost all raw strings
- there is no longer a sieve3 horizontal subdivision
- output triples are grouped and sorted by subject and divided into 210 shards
Numerous changes have happened behind the scenes, the most important of which is that the Spring XML that defines the weekly job has been moved into the bakemono project and is exported in a small JAR file that haruhi reads.
This release has cleared away obstacles to some big changes in dependencies which will happen soon.
Centipede Bump and Miscellaneous Apps
This version of Infovore is linked against Centipede 99.6 and includes a version bump to Spring 4.0.5.
In other news, several half-baked utilities have been checked in, for instance, you can do
haruhi run ssh i-598b673e
to ssh to a machine using an AMZN instance id instead of an ip address.
Job cost accounting
The major feature in this release is a job cost accounting function.
Support for Hadoop Job-Level Accounting
Haruhi now writes a tag with the Hadoop job id to all line items for the job so we can add up line items with this tag to calculate that cost of a job after the fact. When running a flow (multiple jobs), Haruhi now uses the command line arguments of the flow to determine the name of the flow.
smushObject tool and weekly flow optimization
Tuning job parameters has sped up the weekly flow from 2.5 hours to about 57 minutes with a small cost reduction. A job to smush objects has been created so it is now possible to import Dbpedia PageLinks into the
:BaseKB space.
sumRDF
smushSubject tool
smushSubject uses a reduce-side join to change the vocabulary used in the subject field.
backport SelfAwareTool from telepath project
This release moves the "SelfAwareTool" from the telepath project into infovore; this component automatically configures a Hadoop job based on introspection of the environment of the Tool object.