Skip to content

Findings & Statistics

Nazmican Çalık edited this page Apr 25, 2019 · 3 revisions

Findings

Pages From the Dump

We are getting the wikipedia pages from the dump: trwiki-20190401-pages-articles-multistream.xml.bz2 There are 869719 pages in total.

We are using a module called mwxml to parse the xml file and get the pages. Xml dump doesn't contain the referencing pages information. That's why we need to find a way to get the list of referencing pages. There are several ways I can think of:

  • Traverse all the pages to get the links of a page. (Time inefficient) (Doesn't actually take so long)
  • First traverse all pages and save the links in a map. (Memory inefficient)
  • Get the referencing pages from wikipedia with an http request. (Network lag)

Statistics

There are 869719 pages in the trwiki 01/04/19 dump.

  • Map construction takes: 196.45408964157104 seconds. (We need to do it one time only, For writing the XML Dump into memory.)
  • One vdt link search takes: 1.8873939514160156 seconds. (We have 70467 vdt.)
  • In approximately 38 hours we can be done.
Clone this wiki locally