-
Notifications
You must be signed in to change notification settings - Fork 59
Unexpected metadata for non-existent ebooks #57
Comments
For reference in fixing this, the RDFs have an entry for 182:
So whatever solution is developed should take phantom RDF entries into consideration. |
Ah, interesting. I see the phantom RDF says language = "en" and rights = "None", and the other fields are missing hence |
There's quite a few "phantom RDFs" in the Gutenberg RDF download for whatever reason. I think it would be safe to say that any RDF that is missing title and author and the publisher is "Project Gutenberg", then it's probably a phantom entry and should be "discarded" in some fashion when setting up the cache as I cannot see why anyone would be interested in these things when using this library. |
This is probably quite ignorant of me, but how are you accessing the language and subject parts of a book? I'm told that they cannot be queried (no metadata extractor). I know the functionality exists because I've found the testcases in your github's code. What extra parts did you download or include to make it work.....? Step by step instructions would be appreciated because I have a tendency to mess things up. |
The code in the repository is slightly ahead of the PyPI version. A number of the recent improvements (including more metadata extractors) are being held up by the need to decide if we're going to keep support for Python 3.2 or not. To run the latest version of the code, you can follow the instructions in the readme for installing from source. If you run into issues with that, you'll probably want to create an issue (or post a StackOverflow question and then create an issue pointing to it, or any other way to ask for help, of course.) |
On a related note, the RDFs also have dummy entries for 0 and 999999. 0 causes some issues because it causes a InvalidEtextIdException since an etextno of zero is (reasonably) rejected as invalid. Annoyingly, it does have the copyright status field set, which causes a search for all public domain books to error out. pg0.rdf:
|
According to @MasterOdin earlier in the thread, we can identify phantoms by the fact that they are published by Project Gutenberg and they have no title nor author. The following query implements this search: from gutenberg.acquire.metadata import load_metadata
# warning: this is slow, takes minutes to run on my machine
phantom_query = load_metadata().query('''
SELECT ?uri
WHERE {
?uri
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.gutenberg.org/2009/pgterms/ebook>.
?uri
<http://purl.org/dc/terms/publisher>
"Project Gutenberg".
FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/creator> ?creator })
FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/title> ?title }) }
''')
phantom_urls = frozenset(result['uri'].toPython() for result in phantom_query)
with open('phantom_urls.txt', 'w') as fobj:
fobj.write('\n'.join(sorted(phantom_urls))) Let's verify that all of these ebooks don't exist using a quick bash script: while read phantom_url; do
curl --silent "${phantom_url}" | grep -q '<title>404 Not Found</title>' || echo "${phantom_url} did not 404";
done < 'phantom_urls.txt' Turns out that we have some false-positives in the query: valid books without authors and titles!
|
Did a bit more digging. A solid way to identify phantoms seems to be to check if the book has no published formats. SELECT ?uri
WHERE {
?uri
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.gutenberg.org/2009/pgterms/ebook>.
FILTER(NOT EXISTS { ?uri <http://purl.org/dc/terms/hasFormat> ?format })
} This finds all 71 real phantoms identified by the query in my earlier comment minus the 4 false positives. |
Here is a query that we can use to delete all the phantoms: DELETE { ?s ?p ?o }
WHERE {
SELECT ?s ?p ?o
WHERE {
?s a <http://www.gutenberg.org/2009/pgterms/ebook>.
FILTER(NOT EXISTS {
?s <http://purl.org/dc/terms/hasFormat> ?format.
})
} The method to use to execute this query is Graph.update. |
I've done a bit of work towards resolving this issue on the filter-phantom-books branch. We now have a unit test that reproduces the problem and I've implemented the approach discussed above: identify the phantoms at metadata cache creation time and remove them from the cache. However, for some reason, deleting items from the graph doesn't seem to work (code). Does anyone have an idea what could be going wrong here? @hugovk @MasterOdin @ikarth |
The metadata database contains records for non-existent ebooks.
Exists: https://www.gutenberg.org/ebooks/1
Doesn't exist: https://www.gutenberg.org/ebooks/182
Example code:
Actual output:
Expected output:
I'd expect
get_metadata("language", 182)
andget_metadata("rights", 182)
to both returnfrozenset([])
instead offrozenset([u'en'])
andfrozenset([u'None'])
.Or better, as there's no such ebook, perhaps it should it return None or raise an exception: maybe IndexError or something custom like NoEbookIndex, or just don't add it to the database in the first place and let that raise whatever it would raise when an index is not found.
The text was updated successfully, but these errors were encountered: