Tool to extract objects with specified types #92

paulhoule · 2013-12-14T20:18:02Z

This is necessary for skibase. A big part of this will be configuring it so that topics are grouped and sorted on ?s.

A simple and scalable strategy is for the reducer to run in two passes. The first one will check to see if the condition is met, and then the next will send the facts to the output if that is the case.

…ctIsATool is incomplete

…like that.

…r URIs is used

…ks tests

…rking on this assumption. this fix passes the unit tests and I think it will work on the cluster

…using the rewindableValues after all

…icate... is this true and why?

…omething uncontolled in environment in last run?

…; hopefully the logging code isn't making this work...

paulhoule · 2013-12-17T21:48:19Z

I have had some success running the following command

haruhi run job -clusterId tinyAwsCluster extractIsA -dir s3n://basekb-now/2013-12-08-00-00/sieved \
 -input a/a-m-00000.nt.gz \
 -prefix http://rdf.basekb.com/ns/ \
 -type skiing.ski_area \
 -output s3n://basekb-sandbox/only-ski-tiny

and even over all of the a files. However, I get no results when I run

haruhi run job -clusterId smallAwsCluster extractIsA \
 -dir s3n://basekb-now/2013-12-08-00-00/sieved \
 -R 6 \
 -input a,description,dotdot,key,keyNs,label,links,literals,name,notability,notableForPredicate,text,webpages \
 -prefix http://rdf.basekb.com/ns/skiing. \
 -type ski_area,ski_run,ski_lift,ski_area_ownership,yearly_snowfall,ski_area_owner \
 -type lift_tenure,lift_type,ski_lift_manufacturer,run_rating,run_rating_symbol,ski_lodge \
 -output s3n://basekb-sandbox/only-ski

now I did take out the logging code, but it really ought to work without it. (Although lately I've had a lot of cases where things started working when I added logging code...)

paulhoule · 2013-12-17T21:51:54Z

ooooh.... looks like reducer 0 failed with the following message...

2013-12-17 21:02:11,155 FATAL org.apache.hadoop.mapred.Task (MapOutputCopier attempt_201312171944_0001_r_000005_0.1): attempt_201312171944_0001_r_000005_0 : Map output copy failure : java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1711)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1571)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1412)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1344)

2013-12-17 21:02:11,197 INFO org.apache.hadoop.mapred.ReduceTask (MapOutputCopier attempt_201312171944_0001_r_000005_0.18): Read 8737189 bytes from map-output for attempt_201312171944_0001_m_000197_0
2013-12-17 21:02:11,197 INFO org.apache.hadoop.mapred.ReduceTaskStatus (main): addFetchOkMap attempt_201312171944_0001_m_000197_0
2013-12-17 21:02:11,198 INFO org.apache.hadoop.mapred.ReduceTask (main): attempt_201312171944_0001_r_000005_0 Scheduled 1 outputs (1 slow hosts and0 dup hosts)
2013-12-17 21:02:11,231 FATAL org.apache.hadoop.mapred.Task (Thread for polling Map Completion Events): attempt_201312171944_0001_r_000005_0 GetMapEventsThread Ignoring exception : org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201312171944_0001_r_000005_0, with JvmId: jvm_201312171944_0001_r_285086879
    at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3394)
    at org.apache.hadoop.mapred.TaskTracker.getMapCompletionEvents(TaskTracker.java:3653)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)

    at org.apache.hadoop.ipc.Client.call(Client.java:1067)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at com.sun.proxy.$Proxy1.getMapCompletionEvents(Unknown Source)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2907)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2865)

2013-12-17 21:02:11,262 FATAL org.apache.hadoop.mapred.Task (Thread for polling Map Completion Events): Failed to contact the tasktracker
org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201312171944_0001_r_000005_0, with JvmId: jvm_201312171944_0001_r_285086879
    at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3394)
    at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3636)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)

Note that our nodes are highly memory constrained so I can believe we are running out of heap. It's disturbing that the system seems to think that this task succeeded.

paulhoule · 2013-12-17T21:55:53Z

Note that two processes failed, and the other processes wound up like

2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Initializing type list
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/run_rating_symbol>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/lift_type>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_area_owner>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_area_ownership>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_run>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/run_rating>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_lift_manufacturer>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/yearly_snowfall>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_area>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_lodge>]
2013-12-17 21:29:35,983 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/lift_tenure>]
2013-12-17 21:29:35,983 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_lift>]

probably our prefix processing is too smart and it doesn't just prepend the prefix, so we need to put skiing in all the types, unless we change it to make it dumber.

paulhoule · 2013-12-18T02:50:04Z

Here is what I am running now

haruhi run job -clusterId m1LargeX2AwsCluster extractIsA \
 -dir s3n://basekb-now/2013-12-08-00-00/sieved \
 -R 6 \
 -input a,description,dotdot,key,keyNs,label,links,literals,name,notability,notableForPredicate,text,webpages \
 -prefix http://rdf.basekb.com/ns/ \
 -type skiing.ski_area,skiing.ski_run,skiing.ski_lift,skiing.ski_area_ownership,skiing.yearly_snowfall,skiing.ski_area_owner \
 -type skiing.lift_tenure,skiing.lift_type,skiing.ski_lift_manufacturer,skiing.run_rating,skiing.run_rating_symbol,skiing.ski_lodge \
 -output s3n://basekb-sandbox/only-ski

I get a total of 142 facts out of this so it's obvious that things are still terribly out of whack. What's going on?

…rables because the one for sets seemed to cause trouble in the past

paulhoule · 2013-12-18T20:09:29Z

The result from this is screwier than I expected.

With the above output I am getting a total of 95 facts in the output, which is less than the number of ski areas we should be turning up. Stranger than that, I find duplicates, which is completely unexpected

<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>    <http://rdf.basekb.com/ns/skiing.lift_tenure>   .
<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>       <http://rdf.basekb.com/ns/skiing.lift_tenure>   .
<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>       <http://rdf.basekb.com/ns/skiing.lift_tenure>   .
<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>       <http://rdf.basekb.com/ns/skiing.lift_tenure>   .
<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>       <http://rdf.basekb.com/ns/skiing.lift_tenure>   .

I'm planning on running the following test case

haruhi run job -clusterId m1LargeX2AwsCluster extractIsA \
 -dir s3n://basekb-now/2013-12-08-00-00/sieved \
 -R 6 \
 -input a,label \
 -prefix http://rdf.basekb.com/ns/ \
 -type skiing.ski_area \
 -output s3n://basekb-sandbox/only-ski

once I've checked out a few possible ways this could happen. (Verify no dups in the raw data, look to see that we're not adding the same input path over and over again, etc.)

…t paths

paulhoule · 2013-12-18T21:23:53Z

If we run that test case, we get 475 facts, which is close to the number of ski areas. All of those facts are 'a' facts, and there are still lots of dups...

…re now we're getting hung up because Hadoop is reusing Writables so they are not safe to store in a collection

paulhoule added a commit that referenced this issue Dec 14, 2013

#92 add EntityIsAReducer and test

29e15b3

paulhoule added a commit that referenced this issue Dec 14, 2013

#92 add EntityIsAReducer and test

463ca7c

paulhoule added a commit that referenced this issue Dec 14, 2013

#92 add EntityIsAReducer and test

de87e73

paulhoule added a commit that referenced this issue Dec 14, 2013

#92 add EntityIsAReducer and test

72d356a

paulhoule added a commit that referenced this issue Dec 15, 2013

#92 add ExtractIsAOptions and start work on ExtractIsATool, but Extra…

121b689

…ctIsATool is incomplete

paulhoule added a commit that referenced this issue Dec 15, 2013

#92 if we're lucky, ExtractIsATool might work

0799de8

paulhoule added a commit that referenced this issue Dec 15, 2013

#92 use the right option parser

8794fda

paulhoule added a commit that referenced this issue Dec 15, 2013

#92 use the right option parser (really)

196c3e5

paulhoule added a commit that referenced this issue Dec 15, 2013

#92 write unit test for mapper, fix types, make non-abstract, things …

b825a36

…like that.

paulhoule added a commit that referenced this issue Dec 16, 2013

#92 add logging code to find out why we aren't getting any output

1923876

paulhoule added a commit that referenced this issue Dec 16, 2013

#92 add test for argument parsing, make sure that prefix prepender fo…

1c9e869

…r URIs is used

paulhoule added a commit that referenced this issue Dec 16, 2013

#92 change conditional to eliminate possible failure modes

c4cbff2

paulhoule added a commit that referenced this issue Dec 16, 2013

#92 change yet another conditional -- this change (and the last) brea…

5325557

…ks tests

paulhoule added a commit that referenced this issue Dec 17, 2013

#92 i don't think hadoop honors the contract for iterable and I am wo…

c4fa870

…rking on this assumption. this fix passes the unit tests and I think it will work on the cluster

paulhoule added a commit that referenced this issue Dec 17, 2013

#92 another experiment that breaks the tests

83da554

paulhoule added a commit that referenced this issue Dec 17, 2013

#92 another experiment that fixes the test, it turned out we weren't …

0f64fca

…using the rewindableValues after all

paulhoule added a commit that referenced this issue Dec 17, 2013

#92 test-breaking experiment, why is this function getting hung up?

586fe6d

paulhoule added a commit that referenced this issue Dec 17, 2013

#92 another experiment... it seems like we're not matching the A pred…

17faf7a

…icate... is this true and why?

paulhoule added a commit that referenced this issue Dec 17, 2013

#92 add missing semicolon

135445d

paulhoule added a commit that referenced this issue Dec 17, 2013

#92 logging code says that A.equals(pt.getPredicate())... Was there s…

c473607

…omething uncontolled in environment in last run?

paulhoule added a commit that referenced this issue Dec 17, 2013

#92 remove logging code which could get triggered often in a real run…

6a54c53

…; hopefully the logging code isn't making this work...

paulhoule added a commit that referenced this issue Dec 18, 2013

#92 -- don't use guava's factory function for building lists from ite…

e0c010c

…rables because the one for sets seemed to cause trouble in the past

paulhoule added a commit that referenced this issue Dec 18, 2013

#92 -- add a unit test that confirms we're correctly parsing the inpu…

7cf46b9

…t paths

paulhoule added a commit that referenced this issue Dec 18, 2013

#92 -- call the super method for setup since we've tried everything else

e4b8bc0

paulhoule added a commit that referenced this issue Dec 18, 2013

#92 -- more logging code

bfdfe11

paulhoule added a commit that referenced this issue Dec 18, 2013

#92 -- copy the Text objects that we're storing because I'm pretty su…

89cb4c6

…re now we're getting hung up because Hadoop is reusing Writables so they are not safe to store in a collection

paulhoule added a commit that referenced this issue Dec 18, 2013

#92 -- correct name of m1XLargeX2AwsCluster and add m1XLargeX5AwsCluster

30013e2

paulhoule added a commit that referenced this issue Dec 18, 2013

#92 -- correct name of m1XLargeX2AwsCluster and add m1XLargeX5AwsCluster

b969659

paulhoule closed this as completed Dec 20, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool to extract objects with specified types #92

Tool to extract objects with specified types #92

paulhoule commented Dec 14, 2013

paulhoule commented Dec 17, 2013

paulhoule commented Dec 17, 2013

paulhoule commented Dec 17, 2013

paulhoule commented Dec 18, 2013

paulhoule commented Dec 18, 2013

paulhoule commented Dec 18, 2013

Tool to extract objects with specified types #92

Tool to extract objects with specified types #92

Comments

paulhoule commented Dec 14, 2013

paulhoule commented Dec 17, 2013

paulhoule commented Dec 17, 2013

paulhoule commented Dec 17, 2013

paulhoule commented Dec 18, 2013

paulhoule commented Dec 18, 2013

paulhoule commented Dec 18, 2013