Look for invalid ?o in Freebase data #70

paulhoule · 2013-10-22T19:56:27Z

We have some cases such as the notable type where there are ?o mids that are not represented as mids in the ?s field.

A key thing is that we want a list of unique ?p ?o pairs where this is affected because my understanding of the Freebase dump creation process is that often errors are idiosyncratic to particular predicates. A reasonable approach is to

produce a set of unique ?s
produce a set of unique ?o
join the list above to find invalid ?o's (these could be produced with the node as the subject and the key being "this was an s" and "this was an o", highly scalable because we only have at most two nodes in a bucket)
produce a set of unique ?p ?o
join invalid ?o's against ?p - ?o's. If we can get the 'invalid' marker to sort ahead of the ?p - ?o's, then we can do this streaming w/ no memory consumption

We could do it in fewer stages, but I think the memory consumption would be higher and it would be less scalable.

… objects from triples so we can extract the set of all distinct objects

…cript to extract all unique object links

…bject URIs (filter out if not rdf.basekb.com) and refactor UniqTool out to simplify development of various uniq tools

…perators for subject, object and predicate ready for integration testing

paulhoule · 2013-11-18T20:17:30Z

To get a "definition of done", this will be the construction of a flow that does all of the steps required to perform this analysis (extract unique subjects, objects, predicates, then run the set differences)

paulhoule · 2013-11-20T18:55:41Z

Here is the command I am running to do the diff

haruhi run job -clusterId smallAwsCluster
setDifference -r 4
s3n://basekb-sandbox/2013-11-17-00-00/uniqInternalURIObjects
s3n://basekb-sandbox/2013-11-17-00-00/uniqURISubjects2
s3n://basekb-sandbox/2013-11-17-00-00/missingObjects2 \

paulhoule · 2013-11-20T19:01:40Z

See https://groups.google.com/forum/#!topic/infovore-basekb/8bZTgKxWSiQ

paulhoule · 2013-11-20T19:09:18Z

If we wind up with a bunch of objects hanging in space it will be harder to characterize them because, unlike the predicates, these will be twisty little mids that all look alike.

One way to characterize them would be to do a join where we have a set of contains (S,?o), and then we retrieve all (?s ?p ?o1) where ?o1=?o.

For the join here we are going to have the ?o as the key and the tag will be 1 if it is one of the ?o's and 2 if it is a triple. The "value" of the ?o is not material, but the value of the triple is just a Text representation of a triple. If we see a row with a 1 come up first then we write the 2's to the output, otherwise we write the 1's.

Once we get the actual offending triples it ought to be obvious what we are dealing with.

paulhoule · 2013-11-20T19:19:45Z

Ok we get 2,136,121 loose predicates and almost all of them are mid's. Here are the 30 other predicates that show up

$ zcat *.gz | grep -v /m.
<http://rdf.basekb.com/ns/.user.xandr.webscrapper.ads.ads_kind>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%14%30%42%30:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1A%3E%3C%3D%30%42%4B:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.ads_kind>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.price>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.street>
<http://rdf.basekb.com/ns/wikipedia.en.Bibliography_of_Jorge_Luis_Borges>
<http://rdf.basekb.com/ns/.user.xandr.webscrapper.ads.ads_topic>
<http://rdf.basekb.com/ns/.user.xandr.webscrapper.ads.email>
<http://rdf.basekb.com/ns/emql.metacritic>
<http://rdf.basekb.com/ns/user.lbwelch>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%12%30%3D%3D%4B%35+%3A%3E%3C%3D%30%42%4B:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.ads_topic>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.email>
<http://rdf.basekb.com/ns/wikipedia.en.Brothers_Grimm>
<http://rdf.basekb.com/ns/.user.xandr.webscrapper.ads.geo_city>
<http://rdf.basekb.com/ns/en.death_valley_national_park>
<http://rdf.basekb.com/ns/en>
<http://rdf.basekb.com/ns/user.pinworm27.afdb>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%14%3E%3C%30%48%3D%38%35+%36%38%32%3E%42%3D%4B%35:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1A%3E%3C%38%41%41%38%4F+%31%40%3E%3A%35%40%30:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1C%35%41%42%3F%3E%3B%3E%36%35%3D%38%35:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%22%3E%47%3D%4B%39+%30%34%40%35%41::>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%42%35%3B%35%44%3E%3D:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.geo_city>
<http://rdf.basekb.com/ns/wikipedia.en.Liaden_universe>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1C%35%31%3B%38%40%3E%32%30%3D%3D%30%4F:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1C%35%42%40%30%36:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.phone>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.zone>

…structure that supports it.

paulhoule · 2013-11-20T23:18:28Z

to do tomorrow: write test for MatchingKeyReducer and FetchTriplesWithMatchingObjectsMapper

…h scanning, but...)

paulhoule · 2013-11-21T23:23:19Z

Here is the command that is supposed to match up the missing objects with the triples responsible for them

haruhi run job -clusterId smallAwsCluster fetchWithMatchingObjects3 -r 4 \
   s3n://basekb-sandbox/2013-11-17-00-00/missingObjects2
   s3n://basekb-now/2013-11-17-00-00/sieved/a \
   s3n://basekb-now/2013-11-17-00-00/sieved/description \
   s3n://basekb-now/2013-11-17-00-00/sieved/key \
   s3n://basekb-now/2013-11-17-00-00/sieved/keyNs \
   s3n://basekb-now/2013-11-17-00-00/sieved/label \
   s3n://basekb-now/2013-11-17-00-00/sieved/links \
   s3n://basekb-now/2013-11-17-00-00/sieved/name \
   s3n://basekb-now/2013-11-17-00-00/sieved/notability \
   s3n://basekb-now/2013-11-17-00-00/sieved/notableForPredicate \
   s3n://basekb-now/2013-11-17-00-00/sieved/other \
   s3n://basekb-now/2013-11-17-00-00/sieved/text \ 
   s3n://basekb-now/2013-11-17-00-00/sieved/webpages \
   s3n://basekb-sandbox/2013-11-17-00-00/triplesWithMissingObjects

paulhoule · 2013-11-22T16:32:48Z

Note when I ran the command above I found that none of the arguments after /sieved/text were properly parsed because there was a space after the backslash! Shades of python in a shell script, yikes.

paulhoule · 2013-11-22T17:34:32Z

I fixed that problem, then I ran it with a mediumAwsCluster but I forgot to bump -r 4 up and I started running out of heap. Possibly I could have done better with more reducers, but there are many segments up there (description, name) that only contain literals and these could be removed from input.

To speed up the debug cycle I'm going to make a tiny synthetic test case against just one file, labels-m-00001.gz

paulhoule · 2013-11-22T18:12:43Z

ok, actually I ran it against a-m-00000.gz because we're really interested in URI objects.

When I did a grep of the output triples and of the input triples, they got exactly the same number for type.property and measurement_unit.dated_percentage when grepped. So it looks like the algorithm is good, but somehow we're creating a GIGO situation.

Here is the new command:

haruhi run job -clusterId mediumAwsCluster fetchWithMatchingObjects3 -r 11 \
   s3n://basekb-sandbox/2013-11-17-00-00/missingObjects2
   s3n://basekb-now/2013-11-17-00-00/sieved/a \
   s3n://basekb-now/2013-11-17-00-00/sieved/links \
   s3n://basekb-now/2013-11-17-00-00/sieved/notability \
   s3n://basekb-now/2013-11-17-00-00/sieved/webpages \
   s3n://basekb-sandbox/2013-11-17-00-00/triplesWithMissingObjects

paulhoule · 2013-11-24T02:37:08Z

At this point there is still a problem because the job above doesn't gives empty output. On the other hand, in the missingObjects2 file I see the following node

<http://rdf.basekb.com/ns/m.01000h3>

and in links/links-m-00017.nt.gz I see the triple

<http://rdf.basekb.com/ns/m.01dftfd>  <http://rdf.basekb.com/ns/m.0j2r8t8>    <http://rdf.basekb.com/ns/m.01000h3>    .

so somehow this fact is getting lost.

paulhoule added a commit that referenced this issue Nov 12, 2013

initial work on #70 -- in particular add mapper that will extract URI…

2b46030

… objects from triples so we can extract the set of all distinct objects

paulhoule added a commit that referenced this issue Nov 12, 2013

resolve #75 and more work towards #70, in particular development of s…

4420295

…cript to extract all unique object links

paulhoule added a commit that referenced this issue Nov 12, 2013

#70 add application to application list

5bd2c6c

paulhoule added a commit that referenced this issue Nov 12, 2013

#70 fix wrong class that broke reducer

c6b443f

paulhoule added a commit that referenced this issue Nov 12, 2013

#70 add projection operates for predicate and subject, and internal o…

7ce483b

…bject URIs (filter out if not rdf.basekb.com) and refactor UniqTool out to simplify development of various uniq tools

paulhoule added a commit that referenced this issue Nov 12, 2013

#70 get (ugly/repetitive/error-prone) implementations of projection o…

52e4677

…perators for subject, object and predicate ready for integration testing

paulhoule mentioned this issue Nov 13, 2013

Set difference operator #79

Closed

paulhoule added a commit that referenced this issue Nov 20, 2013

#70 initial work on FetchTriplesWithMatchingObjectsTool and the infra…

cd7122c

…structure that supports it.

paulhoule added a commit that referenced this issue Nov 21, 2013

#70 mapper and reducer pass tests

afa2272

paulhoule added a commit that referenced this issue Nov 21, 2013

#70 add a link to the Main class (yes I know we ought to use classpat…

16dd585

…h scanning, but...)

paulhoule added a commit that referenced this issue Nov 21, 2013

#70 make types match up in Tool

6c317c5

paulhoule added a commit that referenced this issue Nov 22, 2013

#70 add logging to check up on possible cause of failure of last run

1e3fdf2

paulhoule closed this as completed Nov 25, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look for invalid ?o in Freebase data #70

Look for invalid ?o in Freebase data #70

paulhoule commented Oct 22, 2013

paulhoule commented Nov 18, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 21, 2013

paulhoule commented Nov 22, 2013

paulhoule commented Nov 22, 2013

paulhoule commented Nov 22, 2013

paulhoule commented Nov 24, 2013

Look for invalid ?o in Freebase data #70

Look for invalid ?o in Freebase data #70

Comments

paulhoule commented Oct 22, 2013

paulhoule commented Nov 18, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 20, 2013

paulhoule commented Nov 21, 2013

paulhoule commented Nov 22, 2013

paulhoule commented Nov 22, 2013

paulhoule commented Nov 22, 2013

paulhoule commented Nov 24, 2013