-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look for invalid ?o in Freebase data #70
Comments
… objects from triples so we can extract the set of all distinct objects
…cript to extract all unique object links
…bject URIs (filter out if not rdf.basekb.com) and refactor UniqTool out to simplify development of various uniq tools
…perators for subject, object and predicate ready for integration testing
To get a "definition of done", this will be the construction of a flow that does all of the steps required to perform this analysis (extract unique subjects, objects, predicates, then run the set differences) |
Here is the command I am running to do the diff haruhi run job -clusterId smallAwsCluster |
If we wind up with a bunch of objects hanging in space it will be harder to characterize them because, unlike the predicates, these will be twisty little mids that all look alike. One way to characterize them would be to do a join where we have a set of contains (S,?o), and then we retrieve all (?s ?p ?o1) where ?o1=?o. For the join here we are going to have the ?o as the key and the tag will be 1 if it is one of the ?o's and 2 if it is a triple. The "value" of the ?o is not material, but the value of the triple is just a Text representation of a triple. If we see a row with a 1 come up first then we write the 2's to the output, otherwise we write the 1's. Once we get the actual offending triples it ought to be obvious what we are dealing with. |
Ok we get 2,136,121 loose predicates and almost all of them are mid's. Here are the 30 other predicates that show up
|
to do tomorrow: write test for MatchingKeyReducer and FetchTriplesWithMatchingObjectsMapper |
Here is the command that is supposed to match up the missing objects with the triples responsible for them
|
Note when I ran the command above I found that none of the arguments after /sieved/text were properly parsed because there was a space after the backslash! Shades of python in a shell script, yikes. |
I fixed that problem, then I ran it with a mediumAwsCluster but I forgot to bump -r 4 up and I started running out of heap. Possibly I could have done better with more reducers, but there are many segments up there (description, name) that only contain literals and these could be removed from input. To speed up the debug cycle I'm going to make a tiny synthetic test case against just one file, labels-m-00001.gz |
ok, actually I ran it against a-m-00000.gz because we're really interested in URI objects. When I did a grep of the output triples and of the input triples, they got exactly the same number for type.property and measurement_unit.dated_percentage when grepped. So it looks like the algorithm is good, but somehow we're creating a GIGO situation. Here is the new command:
|
At this point there is still a problem because the job above doesn't gives empty output. On the other hand, in the missingObjects2 file I see the following node
and in links/links-m-00017.nt.gz I see the triple
so somehow this fact is getting lost. |
We have some cases such as the notable type where there are ?o mids that are not represented as mids in the ?s field.
A key thing is that we want a list of unique ?p ?o pairs where this is affected because my understanding of the Freebase dump creation process is that often errors are idiosyncratic to particular predicates. A reasonable approach is to
We could do it in fewer stages, but I think the memory consumption would be higher and it would be less scalable.
The text was updated successfully, but these errors were encountered: