Slow loading of the Wikidata .bz2 dump #105

kermitt2 · 2020-06-12T14:24:46Z

The Wikidata dump became very big with 1.2 billion statements which makes the initial loading of the bz2 dump into lmdb particularly slow.

To speed-up this step, we could try:

instead of having 2 pass on the dump, one to get the properties and one to get the statements, we could do both in one pass and solve the property resolution subsequently with the db
instead of reading line by line, try with larger buffer blocks

kermitt2 · 2020-06-12T22:18:29Z

Complementary info:

full loading of Wikidata + 5 wikipedia languages from compiled csv files takes now 22h45m from a mechanical hard drive... (ok I should use a SSD...)
the Wikidata statement database in particular went up from 17GB to 62GB in a bit more than 2 years.

The good point is that the increase of Wikidata volume does not affect runtime, just the storage size.

oterrier · 2020-10-21T06:38:52Z

Hi Patrice
According to https://www.wikidata.org/wiki/Wikidata:Statistics, one of the main reason of the explosion of the statement db is that recently most of the published scientific articles have now an entry in wikidata
They currently represent more than 22M concepts out of 71M
I understand the interest to be able to build graphs between authors and articles but it is not very interesting for entity fishing given that these scholary articles have no wikipedia pages associated and have long titles that cannot be recognized by the current EF mention recognizers.
Take the "Attention Is All You Need" paper for example : https://www.wikidata.org/wiki/Q30249683
So one possible optimization of the statement db size would be to be able to filter out some classes ("scholary article" being one of them) when initially building the lmdb database
Let's imagine you can define such filtering constraint somewhere (or had code them?) for example in the kb.yaml file:

#dataDirectory: /home/lopez/resources/wikidata/

# Exclude scholary articles from statement db
excludedConceptStatements:
  - conceptId:
    propertyId: P31
    value: Q13442814

When filling the statement db if I detect a concept meeting the constraint ("instance of" "scholary article" for example) then I forget this concept and I don't store the statements

          	if ((propertytId != null) && (value != null)) {
			if (excludedConceptStatements != null) {
				for (Statement excludedConceptStatement : excludedConceptStatements) {
					exclude = (excludedConceptStatement.getConceptId() == null || excludedConceptStatement.getConceptId() == itemId) &&
							(excludedConceptStatement.getPropertyId() == null || excludedConceptStatement.getPropertyId() == propertytId) &&
							(excludedConceptStatement.getValue() == null || excludedConceptStatement.getValue() == value);
					if (exclude)
						break;
				}
			}
			Statement statement = new Statement(itemId, propertytId, value);
//System.out.println("Adding: " + statement.toString());
			if (!statements.contains(statement))
				statements.add(statement);
		}
...
...
			if (statements.size() > 0 && !exclude) {
				try {
					db.put(tx, KBEnvironment.serialize(itemId), KBEnvironment.serialize(statements));
					nbToAdd++;
					nbTotalAdded++;
				} catch(Exception e) {
					e.printStackTrace();
				}
			}

I think we can considerably reduce the size of the statement db

I can even propose a PR for such a mechanism

Best regards
Olivier

kermitt2 added the enhancement label Jun 12, 2020

oterrier mentioned this issue May 30, 2024

Find a way to reduce the size of the wikidata knowledge #162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow loading of the Wikidata .bz2 dump #105

Slow loading of the Wikidata .bz2 dump #105

kermitt2 commented Jun 12, 2020

kermitt2 commented Jun 12, 2020

oterrier commented Oct 21, 2020 •

edited

Loading

Slow loading of the Wikidata .bz2 dump #105

Slow loading of the Wikidata .bz2 dump #105

Comments

kermitt2 commented Jun 12, 2020

kermitt2 commented Jun 12, 2020

oterrier commented Oct 21, 2020 • edited Loading

oterrier commented Oct 21, 2020 •

edited

Loading