Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool to extract objects with specified types #92

Closed
paulhoule opened this issue Dec 14, 2013 · 6 comments
Closed

Tool to extract objects with specified types #92

paulhoule opened this issue Dec 14, 2013 · 6 comments
Milestone

Comments

@paulhoule
Copy link
Owner

This is necessary for skibase. A big part of this will be configuring it so that topics are grouped and sorted on ?s.

A simple and scalable strategy is for the reducer to run in two passes. The first one will check to see if the condition is met, and then the next will send the facts to the output if that is the case.

paulhoule added a commit that referenced this issue Dec 14, 2013
paulhoule added a commit that referenced this issue Dec 14, 2013
paulhoule added a commit that referenced this issue Dec 14, 2013
paulhoule added a commit that referenced this issue Dec 14, 2013
paulhoule added a commit that referenced this issue Dec 15, 2013
paulhoule added a commit that referenced this issue Dec 15, 2013
paulhoule added a commit that referenced this issue Dec 17, 2013
…rking on this assumption. this fix passes the unit tests and I think it will work on the cluster
paulhoule added a commit that referenced this issue Dec 17, 2013
paulhoule added a commit that referenced this issue Dec 17, 2013
paulhoule added a commit that referenced this issue Dec 17, 2013
paulhoule added a commit that referenced this issue Dec 17, 2013
…omething uncontolled in environment in last run?
paulhoule added a commit that referenced this issue Dec 17, 2013
…; hopefully the logging code isn't making this work...
@paulhoule
Copy link
Owner Author

I have had some success running the following command

haruhi run job -clusterId tinyAwsCluster extractIsA -dir s3n://basekb-now/2013-12-08-00-00/sieved \
 -input a/a-m-00000.nt.gz \
 -prefix http://rdf.basekb.com/ns/ \
 -type skiing.ski_area \
 -output s3n://basekb-sandbox/only-ski-tiny

and even over all of the a files. However, I get no results when I run

haruhi run job -clusterId smallAwsCluster extractIsA \
 -dir s3n://basekb-now/2013-12-08-00-00/sieved \
 -R 6 \
 -input a,description,dotdot,key,keyNs,label,links,literals,name,notability,notableForPredicate,text,webpages \
 -prefix http://rdf.basekb.com/ns/skiing. \
 -type ski_area,ski_run,ski_lift,ski_area_ownership,yearly_snowfall,ski_area_owner \
 -type lift_tenure,lift_type,ski_lift_manufacturer,run_rating,run_rating_symbol,ski_lodge \
 -output s3n://basekb-sandbox/only-ski

now I did take out the logging code, but it really ought to work without it. (Although lately I've had a lot of cases where things started working when I added logging code...)

@paulhoule
Copy link
Owner Author

ooooh.... looks like reducer 0 failed with the following message...

2013-12-17 21:02:11,155 FATAL org.apache.hadoop.mapred.Task (MapOutputCopier attempt_201312171944_0001_r_000005_0.1): attempt_201312171944_0001_r_000005_0 : Map output copy failure : java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1711)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1571)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1412)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1344)

2013-12-17 21:02:11,197 INFO org.apache.hadoop.mapred.ReduceTask (MapOutputCopier attempt_201312171944_0001_r_000005_0.18): Read 8737189 bytes from map-output for attempt_201312171944_0001_m_000197_0
2013-12-17 21:02:11,197 INFO org.apache.hadoop.mapred.ReduceTaskStatus (main): addFetchOkMap attempt_201312171944_0001_m_000197_0
2013-12-17 21:02:11,198 INFO org.apache.hadoop.mapred.ReduceTask (main): attempt_201312171944_0001_r_000005_0 Scheduled 1 outputs (1 slow hosts and0 dup hosts)
2013-12-17 21:02:11,231 FATAL org.apache.hadoop.mapred.Task (Thread for polling Map Completion Events): attempt_201312171944_0001_r_000005_0 GetMapEventsThread Ignoring exception : org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201312171944_0001_r_000005_0, with JvmId: jvm_201312171944_0001_r_285086879
    at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3394)
    at org.apache.hadoop.mapred.TaskTracker.getMapCompletionEvents(TaskTracker.java:3653)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)

    at org.apache.hadoop.ipc.Client.call(Client.java:1067)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at com.sun.proxy.$Proxy1.getMapCompletionEvents(Unknown Source)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2907)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2865)

2013-12-17 21:02:11,262 FATAL org.apache.hadoop.mapred.Task (Thread for polling Map Completion Events): Failed to contact the tasktracker
org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201312171944_0001_r_000005_0, with JvmId: jvm_201312171944_0001_r_285086879
    at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3394)
    at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3636)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)

Note that our nodes are highly memory constrained so I can believe we are running out of heap. It's disturbing that the system seems to think that this task succeeded.

@paulhoule
Copy link
Owner Author

Note that two processes failed, and the other processes wound up like

2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Initializing type list
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/run_rating_symbol>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/lift_type>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_area_owner>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_area_ownership>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_run>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/run_rating>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_lift_manufacturer>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/yearly_snowfall>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_area>]
2013-12-17 21:29:35,982 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_lodge>]
2013-12-17 21:29:35,983 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/lift_tenure>]
2013-12-17 21:29:35,983 INFO com.ontology2.bakemono.entityCentric.EntityIsAReducer (main): Accepting type: [<http://rdf.basekb.com/ns/ski_lift>]

probably our prefix processing is too smart and it doesn't just prepend the prefix, so we need to put skiing in all the types, unless we change it to make it dumber.

@paulhoule
Copy link
Owner Author

Here is what I am running now

haruhi run job -clusterId m1LargeX2AwsCluster extractIsA \
 -dir s3n://basekb-now/2013-12-08-00-00/sieved \
 -R 6 \
 -input a,description,dotdot,key,keyNs,label,links,literals,name,notability,notableForPredicate,text,webpages \
 -prefix http://rdf.basekb.com/ns/ \
 -type skiing.ski_area,skiing.ski_run,skiing.ski_lift,skiing.ski_area_ownership,skiing.yearly_snowfall,skiing.ski_area_owner \
 -type skiing.lift_tenure,skiing.lift_type,skiing.ski_lift_manufacturer,skiing.run_rating,skiing.run_rating_symbol,skiing.ski_lodge \
 -output s3n://basekb-sandbox/only-ski

I get a total of 142 facts out of this so it's obvious that things are still terribly out of whack. What's going on?

paulhoule added a commit that referenced this issue Dec 18, 2013
…rables because the one for sets seemed to cause trouble in the past
@paulhoule
Copy link
Owner Author

The result from this is screwier than I expected.

With the above output I am getting a total of 95 facts in the output, which is less than the number of ski areas we should be turning up. Stranger than that, I find duplicates, which is completely unexpected

<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>    <http://rdf.basekb.com/ns/skiing.lift_tenure>   .
<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>       <http://rdf.basekb.com/ns/skiing.lift_tenure>   .
<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>       <http://rdf.basekb.com/ns/skiing.lift_tenure>   .
<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>       <http://rdf.basekb.com/ns/skiing.lift_tenure>   .
<http://rdf.basekb.com/ns/m.04kk2lz>    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>       <http://rdf.basekb.com/ns/skiing.lift_tenure>   .

I'm planning on running the following test case

haruhi run job -clusterId m1LargeX2AwsCluster extractIsA \
 -dir s3n://basekb-now/2013-12-08-00-00/sieved \
 -R 6 \
 -input a,label \
 -prefix http://rdf.basekb.com/ns/ \
 -type skiing.ski_area \
 -output s3n://basekb-sandbox/only-ski

once I've checked out a few possible ways this could happen. (Verify no dups in the raw data, look to see that we're not adding the same input path over and over again, etc.)

@paulhoule
Copy link
Owner Author

If we run that test case, we get 475 facts, which is close to the number of ski areas. All of those facts are 'a' facts, and there are still lots of dups...

paulhoule added a commit that referenced this issue Dec 18, 2013
paulhoule added a commit that referenced this issue Dec 18, 2013
…re now we're getting hung up because Hadoop is reusing Writables so they are not safe to store in a collection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant