Skip to content
This repository has been archived by the owner on Jan 12, 2023. It is now read-only.

Unexpected metadata for non-existent ebooks #57

Open
hugovk opened this issue Oct 31, 2016 · 11 comments
Open

Unexpected metadata for non-existent ebooks #57

hugovk opened this issue Oct 31, 2016 · 11 comments

Comments

@hugovk
Copy link
Collaborator

hugovk commented Oct 31, 2016

The metadata database contains records for non-existent ebooks.

Exists: https://www.gutenberg.org/ebooks/1
Doesn't exist: https://www.gutenberg.org/ebooks/182

Example code:

from __future__ import print_function
from gutenberg.query import get_metadata

def get_all_metadata(etextno):
    for feature_name in ['author', 'formaturi', 'language',
                         'rights', 'subject', 'title']:
        print("{}\t{}\t{}".format(
            etextno,
            feature_name,
            get_metadata(feature_name, etextno)))
    print()

get_all_metadata(1)  # US Declaration of Independence
get_all_metadata(182)  # no such ebook

Actual output:

1       author  frozenset([u'United States President (1801-1809)'])
1       formaturi       frozenset([u'http://www.gutenberg.org/ebooks/1.txt.utf-8', u'http://www.gutenberg.org/ebooks/1.e
pub.noimages', u'http://www.gutenberg.org/6/5/2/6527/6527-t/6527-t.tex', u'http://www.gutenberg.org/ebooks/1.html.noimag
es', u'http://www.gutenberg.org/files/1/1.zip', u'http://www.gutenberg.org/ebooks/1.epub.images', u'http://www.gutenberg
.org/ebooks/1.rdf', u'http://www.gutenberg.org/ebooks/1.kindle.noimages', u'http://www.gutenberg.org/files/1/1.txt', u'h
ttp://www.gutenberg.org/ebooks/1.html.images', u'http://www.gutenberg.org/6/5/2/6527/6527-t.zip', u'http://www.gutenberg
.org/ebooks/1.kindle.images'])
1       language        frozenset([u'en'])
1       rights  frozenset([u'Public domain in the USA.'])
1       subject frozenset([u'E201', u'United States. Declaration of Independence', u'United States -- History -- Revolut
ion, 1775-1783 -- Sources', u'JK'])
1       title   frozenset([u'The Declaration of Independence of the United States of America'])

182     author  frozenset([])
182     formaturi       frozenset([])
182     language        frozenset([u'en'])
182     rights  frozenset([u'None'])
182     subject frozenset([])
182     title   frozenset([])

Expected output:

I'd expect get_metadata("language", 182) and get_metadata("rights", 182) to both return frozenset([]) instead of frozenset([u'en']) and frozenset([u'None']).

Or better, as there's no such ebook, perhaps it should it return None or raise an exception: maybe IndexError or something custom like NoEbookIndex, or just don't add it to the database in the first place and let that raise whatever it would raise when an index is not found.

@ikarth
Copy link
Contributor

ikarth commented Oct 31, 2016

For reference in fixing this, the RDFs have an entry for 182:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xml:base="http://www.gutenberg.org/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dcam="http://purl.org/dc/dcam/"
  xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:cc="http://web.resource.org/cc/"
>
  <pgterms:ebook rdf:about="ebooks/182">
    <dcterms:publisher>Project Gutenberg</dcterms:publisher>
    <dcterms:language>
      <rdf:Description rdf:nodeID="N4260554339da4361bcc361fd183ef804">
        <rdf:value rdf:datatype="http://purl.org/dc/terms/RFC4646">en</rdf:value>
      </rdf:Description>
    </dcterms:language>
    <dcterms:type>
      <rdf:Description rdf:nodeID="Na773bbd642f9428e9e07b94f85257f29">
        <rdf:value>Text</rdf:value>
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/DCMIType"/>
      </rdf:Description>
    </dcterms:type>
    <dcterms:rights>None</dcterms:rights>
    <dcterms:issued>None</dcterms:issued>
    <dcterms:license rdf:resource="license"/>
    <pgterms:downloads rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">0</pgterms:downloads>
  </pgterms:ebook>
  <cc:Work rdf:about="">
    <cc:license rdf:resource="http://www.gnu.org/licenses/gpl.html"/>
    <rdfs:comment>Archives containing the RDF files for *all* our books can be downloaded at
            http://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog</rdfs:comment>
  </cc:Work>
</rdf:RDF>

So whatever solution is developed should take phantom RDF entries into consideration.

@hugovk
Copy link
Collaborator Author

hugovk commented Oct 31, 2016

Ah, interesting. I see the phantom RDF says language = "en" and rights = "None", and the other fields are missing hence None for those.

@MasterOdin
Copy link
Collaborator

MasterOdin commented Nov 7, 2016

There's quite a few "phantom RDFs" in the Gutenberg RDF download for whatever reason. I think it would be safe to say that any RDF that is missing title and author and the publisher is "Project Gutenberg", then it's probably a phantom entry and should be "discarded" in some fashion when setting up the cache as I cannot see why anyone would be interested in these things when using this library.

@AmaterasuInTheSky
Copy link

AmaterasuInTheSky commented Nov 13, 2016

This is probably quite ignorant of me, but how are you accessing the language and subject parts of a book? I'm told that they cannot be queried (no metadata extractor). I know the functionality exists because I've found the testcases in your github's code. What extra parts did you download or include to make it work.....? Step by step instructions would be appreciated because I have a tendency to mess things up.

@ikarth
Copy link
Contributor

ikarth commented Nov 13, 2016

The code in the repository is slightly ahead of the PyPI version. A number of the recent improvements (including more metadata extractors) are being held up by the need to decide if we're going to keep support for Python 3.2 or not.

To run the latest version of the code, you can follow the instructions in the readme for installing from source. If you run into issues with that, you'll probably want to create an issue (or post a StackOverflow question and then create an issue pointing to it, or any other way to ask for help, of course.)

@ikarth
Copy link
Contributor

ikarth commented Nov 22, 2016

On a related note, the RDFs also have dummy entries for 0 and 999999. 0 causes some issues because it causes a InvalidEtextIdException since an etextno of zero is (reasonably) rejected as invalid. Annoyingly, it does have the copyright status field set, which causes a search for all public domain books to error out.

pg0.rdf:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
   xmlns:cc="http://web.resource.org/cc/"
   xmlns:dcam="http://purl.org/dc/dcam/"
   xmlns:dcterms="http://purl.org/dc/terms/"
   xmlns:marcrel="http://www.loc.gov/loc.terms/relators/"
   xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xml:base="http://www.gutenberg.org/">
  <cc:Work rdf:about="feeds/catalog.rdf">
    <cc:license rdf:resource="http://www.gnu.org/licenses/gpl.html"/>
  </cc:Work>
  <pgterms:ebook rdf:about="ebooks/0">
    <dcterms:issued>None</dcterms:issued>
    <dcterms:language rdf:datatype="http://purl.org/dc/terms/RFC4646">en</dcterms:language>
    <dcterms:license rdf:resource="license"/>
    <dcterms:publisher>Project Gutenberg</dcterms:publisher>
    <dcterms:rights>Public domain in the USA.</dcterms:rights>
    <dcterms:type>
      <rdf:Description>
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/DCMIType"/>
        <rdf:value>Text</rdf:value>
      </rdf:Description>
    </dcterms:type>
  </pgterms:ebook>
</rdf:RDF>

c-w added a commit that referenced this issue Feb 19, 2017
c-w added a commit that referenced this issue Feb 19, 2017
@c-w
Copy link
Owner

c-w commented Feb 19, 2017

@ikarth: Now that #75 is merged, the query crash due to the phantom books should be fixed. Let me know if you still face this problem.

@c-w
Copy link
Owner

c-w commented Feb 19, 2017

According to @MasterOdin earlier in the thread, we can identify phantoms by the fact that they are published by Project Gutenberg and they have no title nor author. The following query implements this search:

from gutenberg.acquire.metadata import load_metadata

# warning: this is slow, takes minutes to run on my machine
phantom_query = load_metadata().query('''
SELECT ?uri
WHERE {
  ?uri
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  <http://www.gutenberg.org/2009/pgterms/ebook>.

  ?uri
  <http://purl.org/dc/terms/publisher>
  "Project Gutenberg".

  FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/creator> ?creator })
  FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/title> ?title }) }
''')

phantom_urls = frozenset(result['uri'].toPython() for result in phantom_query)

with open('phantom_urls.txt', 'w') as fobj:
    fobj.write('\n'.join(sorted(phantom_urls)))

Let's verify that all of these ebooks don't exist using a quick bash script:

while read phantom_url; do
  curl --silent "${phantom_url}" | grep -q '<title>404 Not Found</title>' || echo "${phantom_url} did not 404";
done < 'phantom_urls.txt'

Turns out that we have some false-positives in the query: valid books without authors and titles!

@c-w
Copy link
Owner

c-w commented Feb 19, 2017

Did a bit more digging. A solid way to identify phantoms seems to be to check if the book has no published formats.

SELECT ?uri
WHERE {
  ?uri
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  <http://www.gutenberg.org/2009/pgterms/ebook>.

  FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/hasFormat> ?format })
}

This finds all 71 real phantoms identified by the query in my earlier comment minus the 4 false positives.

@c-w
Copy link
Owner

c-w commented Feb 19, 2017

Here is a query that we can use to delete all the phantoms:

DELETE { ?s ?p ?o }
WHERE {
 SELECT ?s ?p ?o
 WHERE {
  ?s a <http://www.gutenberg.org/2009/pgterms/ebook>.

  FILTER(NOT EXISTS {
   ?s <http://purl.org/dc/terms/hasFormat> ?format.
  })
 }

The method to use to execute this query is Graph.update.

c-w added a commit that referenced this issue Feb 19, 2017
c-w added a commit that referenced this issue Feb 19, 2017
@c-w
Copy link
Owner

c-w commented Feb 19, 2017

I've done a bit of work towards resolving this issue on the filter-phantom-books branch. We now have a unit test that reproduces the problem and I've implemented the approach discussed above: identify the phantoms at metadata cache creation time and remove them from the cache.

However, for some reason, deleting items from the graph doesn't seem to work (code). Does anyone have an idea what could be going wrong here? @hugovk @MasterOdin @ikarth

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants