Unexpected metadata for non-existent ebooks #57

hugovk · 2016-10-31T07:14:30Z

The metadata database contains records for non-existent ebooks.

Exists: https://www.gutenberg.org/ebooks/1
Doesn't exist: https://www.gutenberg.org/ebooks/182

Example code:

from __future__ import print_function
from gutenberg.query import get_metadata

def get_all_metadata(etextno):
    for feature_name in ['author', 'formaturi', 'language',
                         'rights', 'subject', 'title']:
        print("{}\t{}\t{}".format(
            etextno,
            feature_name,
            get_metadata(feature_name, etextno)))
    print()

get_all_metadata(1)  # US Declaration of Independence
get_all_metadata(182)  # no such ebook

Actual output:

1       author  frozenset([u'United States President (1801-1809)'])
1       formaturi       frozenset([u'http://www.gutenberg.org/ebooks/1.txt.utf-8', u'http://www.gutenberg.org/ebooks/1.e
pub.noimages', u'http://www.gutenberg.org/6/5/2/6527/6527-t/6527-t.tex', u'http://www.gutenberg.org/ebooks/1.html.noimag
es', u'http://www.gutenberg.org/files/1/1.zip', u'http://www.gutenberg.org/ebooks/1.epub.images', u'http://www.gutenberg
.org/ebooks/1.rdf', u'http://www.gutenberg.org/ebooks/1.kindle.noimages', u'http://www.gutenberg.org/files/1/1.txt', u'h
ttp://www.gutenberg.org/ebooks/1.html.images', u'http://www.gutenberg.org/6/5/2/6527/6527-t.zip', u'http://www.gutenberg
.org/ebooks/1.kindle.images'])
1       language        frozenset([u'en'])
1       rights  frozenset([u'Public domain in the USA.'])
1       subject frozenset([u'E201', u'United States. Declaration of Independence', u'United States -- History -- Revolut
ion, 1775-1783 -- Sources', u'JK'])
1       title   frozenset([u'The Declaration of Independence of the United States of America'])

182     author  frozenset([])
182     formaturi       frozenset([])
182     language        frozenset([u'en'])
182     rights  frozenset([u'None'])
182     subject frozenset([])
182     title   frozenset([])

Expected output:

I'd expect get_metadata("language", 182) and get_metadata("rights", 182) to both return frozenset([]) instead of frozenset([u'en']) and frozenset([u'None']).

Or better, as there's no such ebook, perhaps it should it return None or raise an exception: maybe IndexError or something custom like NoEbookIndex, or just don't add it to the database in the first place and let that raise whatever it would raise when an index is not found.

The text was updated successfully, but these errors were encountered:

ikarth · 2016-10-31T11:29:15Z

For reference in fixing this, the RDFs have an entry for 182:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xml:base="http://www.gutenberg.org/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dcam="http://purl.org/dc/dcam/"
  xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:cc="http://web.resource.org/cc/"
>
  <pgterms:ebook rdf:about="ebooks/182">
    <dcterms:publisher>Project Gutenberg</dcterms:publisher>
    <dcterms:language>
      <rdf:Description rdf:nodeID="N4260554339da4361bcc361fd183ef804">
        <rdf:value rdf:datatype="http://purl.org/dc/terms/RFC4646">en</rdf:value>
      </rdf:Description>
    </dcterms:language>
    <dcterms:type>
      <rdf:Description rdf:nodeID="Na773bbd642f9428e9e07b94f85257f29">
        <rdf:value>Text</rdf:value>
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/DCMIType"/>
      </rdf:Description>
    </dcterms:type>
    <dcterms:rights>None</dcterms:rights>
    <dcterms:issued>None</dcterms:issued>
    <dcterms:license rdf:resource="license"/>
    <pgterms:downloads rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">0</pgterms:downloads>
  </pgterms:ebook>
  <cc:Work rdf:about="">
    <cc:license rdf:resource="http://www.gnu.org/licenses/gpl.html"/>
    <rdfs:comment>Archives containing the RDF files for *all* our books can be downloaded at
            http://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog</rdfs:comment>
  </cc:Work>
</rdf:RDF>

So whatever solution is developed should take phantom RDF entries into consideration.

hugovk · 2016-10-31T12:26:40Z

Ah, interesting. I see the phantom RDF says language = "en" and rights = "None", and the other fields are missing hence None for those.

MasterOdin · 2016-11-07T15:47:41Z

There's quite a few "phantom RDFs" in the Gutenberg RDF download for whatever reason. I think it would be safe to say that any RDF that is missing title and author and the publisher is "Project Gutenberg", then it's probably a phantom entry and should be "discarded" in some fashion when setting up the cache as I cannot see why anyone would be interested in these things when using this library.

AmaterasuInTheSky · 2016-11-13T09:03:38Z

This is probably quite ignorant of me, but how are you accessing the language and subject parts of a book? I'm told that they cannot be queried (no metadata extractor). I know the functionality exists because I've found the testcases in your github's code. What extra parts did you download or include to make it work.....? Step by step instructions would be appreciated because I have a tendency to mess things up.

ikarth · 2016-11-13T14:04:11Z

The code in the repository is slightly ahead of the PyPI version. A number of the recent improvements (including more metadata extractors) are being held up by the need to decide if we're going to keep support for Python 3.2 or not.

To run the latest version of the code, you can follow the instructions in the readme for installing from source. If you run into issues with that, you'll probably want to create an issue (or post a StackOverflow question and then create an issue pointing to it, or any other way to ask for help, of course.)

ikarth · 2016-11-22T13:56:59Z

On a related note, the RDFs also have dummy entries for 0 and 999999. 0 causes some issues because it causes a InvalidEtextIdException since an etextno of zero is (reasonably) rejected as invalid. Annoyingly, it does have the copyright status field set, which causes a search for all public domain books to error out.

pg0.rdf:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
   xmlns:cc="http://web.resource.org/cc/"
   xmlns:dcam="http://purl.org/dc/dcam/"
   xmlns:dcterms="http://purl.org/dc/terms/"
   xmlns:marcrel="http://www.loc.gov/loc.terms/relators/"
   xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xml:base="http://www.gutenberg.org/">
  <cc:Work rdf:about="feeds/catalog.rdf">
    <cc:license rdf:resource="http://www.gnu.org/licenses/gpl.html"/>
  </cc:Work>
  <pgterms:ebook rdf:about="ebooks/0">
    <dcterms:issued>None</dcterms:issued>
    <dcterms:language rdf:datatype="http://purl.org/dc/terms/RFC4646">en</dcterms:language>
    <dcterms:license rdf:resource="license"/>
    <dcterms:publisher>Project Gutenberg</dcterms:publisher>
    <dcterms:rights>Public domain in the USA.</dcterms:rights>
    <dcterms:type>
      <rdf:Description>
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/DCMIType"/>
        <rdf:value>Text</rdf:value>
      </rdf:Description>
    </dcterms:type>
  </pgterms:ebook>
</rdf:RDF>

This is a partial fix for #57

c-w · 2017-02-19T01:20:12Z

@ikarth: Now that #75 is merged, the query crash due to the phantom books should be fixed. Let me know if you still face this problem.

c-w · 2017-02-19T02:45:38Z

According to @MasterOdin earlier in the thread, we can identify phantoms by the fact that they are published by Project Gutenberg and they have no title nor author. The following query implements this search:

from gutenberg.acquire.metadata import load_metadata

# warning: this is slow, takes minutes to run on my machine
phantom_query = load_metadata().query('''
SELECT ?uri
WHERE {
  ?uri
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  <http://www.gutenberg.org/2009/pgterms/ebook>.

  ?uri
  <http://purl.org/dc/terms/publisher>
  "Project Gutenberg".

  FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/creator> ?creator })
  FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/title> ?title }) }
''')

phantom_urls = frozenset(result['uri'].toPython() for result in phantom_query)

with open('phantom_urls.txt', 'w') as fobj:
    fobj.write('\n'.join(sorted(phantom_urls)))

Let's verify that all of these ebooks don't exist using a quick bash script:

while read phantom_url; do
  curl --silent "${phantom_url}" | grep -q '<title>404 Not Found</title>' || echo "${phantom_url} did not 404";
done < 'phantom_urls.txt'

Turns out that we have some false-positives in the query: valid books without authors and titles!

http://www.gutenberg.org/ebooks/50624 did not 404
http://www.gutenberg.org/ebooks/50625 did not 404
http://www.gutenberg.org/ebooks/51307 did not 404
http://www.gutenberg.org/ebooks/51950 did not 404

c-w · 2017-02-19T02:50:25Z

Did a bit more digging. A solid way to identify phantoms seems to be to check if the book has no published formats.

SELECT ?uri
WHERE {
  ?uri
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  <http://www.gutenberg.org/2009/pgterms/ebook>.

  FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/hasFormat> ?format })
}

This finds all 71 real phantoms identified by the query in my earlier comment minus the 4 false positives.

c-w · 2017-02-19T03:10:44Z

Here is a query that we can use to delete all the phantoms:

DELETE { ?s ?p ?o }
WHERE {
 SELECT ?s ?p ?o
 WHERE {
  ?s a <http://www.gutenberg.org/2009/pgterms/ebook>.

  FILTER(NOT EXISTS {
   ?s <http://purl.org/dc/terms/hasFormat> ?format.
  })
 }

The method to use to execute this query is Graph.update.

Fixes #57

c-w · 2017-02-19T06:21:40Z

I've done a bit of work towards resolving this issue on the filter-phantom-books branch. We now have a unit test that reproduces the problem and I've implemented the approach discussed above: identify the phantoms at metadata cache creation time and remove them from the cache.

However, for some reason, deleting items from the graph doesn't seem to work (code). Does anyone have an idea what could be going wrong here? @hugovk @MasterOdin @ikarth

c-w added a commit that referenced this issue Feb 19, 2017

Prevent phantom books from crashing queries

0ce8098

This is a partial fix for #57

c-w added a commit that referenced this issue Feb 19, 2017

Prevent phantom books from crashing queries

a8e0c6a

This is a partial fix for #57

c-w mentioned this issue Feb 19, 2017

Prevent phantom metadata from crashing queries #75

Merged

c-w added a commit that referenced this issue Feb 19, 2017

Remove phantom entries when building the metadata

5a6e7f7

Fixes #57

c-w added a commit that referenced this issue Feb 19, 2017

Remove phantom entries when building the metadata

b613773

Fixes #57

c-w added help wanted enhancement labels Jul 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected metadata for non-existent ebooks #57

Unexpected metadata for non-existent ebooks #57

hugovk commented Oct 31, 2016

ikarth commented Oct 31, 2016

hugovk commented Oct 31, 2016

MasterOdin commented Nov 7, 2016 •

edited

Loading

AmaterasuInTheSky commented Nov 13, 2016 •

edited

Loading

ikarth commented Nov 13, 2016

ikarth commented Nov 22, 2016 •

edited

Loading

c-w commented Feb 19, 2017

c-w commented Feb 19, 2017

c-w commented Feb 19, 2017

c-w commented Feb 19, 2017 •

edited

Loading

c-w commented Feb 19, 2017 •

edited

Loading

Unexpected metadata for non-existent ebooks #57

Unexpected metadata for non-existent ebooks #57

Comments

hugovk commented Oct 31, 2016

ikarth commented Oct 31, 2016

hugovk commented Oct 31, 2016

MasterOdin commented Nov 7, 2016 • edited Loading

AmaterasuInTheSky commented Nov 13, 2016 • edited Loading

ikarth commented Nov 13, 2016

ikarth commented Nov 22, 2016 • edited Loading

c-w commented Feb 19, 2017

c-w commented Feb 19, 2017

c-w commented Feb 19, 2017

c-w commented Feb 19, 2017 • edited Loading

c-w commented Feb 19, 2017 • edited Loading

MasterOdin commented Nov 7, 2016 •

edited

Loading

AmaterasuInTheSky commented Nov 13, 2016 •

edited

Loading

ikarth commented Nov 22, 2016 •

edited

Loading

c-w commented Feb 19, 2017 •

edited

Loading

c-w commented Feb 19, 2017 •

edited

Loading