-
Notifications
You must be signed in to change notification settings - Fork 2
Findings & Statistics
Nazmican Çalık edited this page Apr 25, 2019
·
3 revisions
We are getting the wikipedia pages from the dump: trwiki-20190401-pages-articles-multistream.xml.bz2 There are 869719 pages in total.
We are using a module called mwxml to parse the xml file and get the pages. Xml dump doesn't contain the referencing pages information. That's why we need to find a way to get the list of referencing pages. There are several ways I can think of:
- Traverse all the pages to get the links of a page. (Time inefficient) (Doesn't actually take so long)
- First traverse all pages and save the links in a map. (Memory inefficient)
- Get the referencing pages from wikipedia with an http request. (Network lag)
There are 869719 pages in the trwiki 01/04/19 dump.
- Map construction takes: 196.45408964157104 seconds. (We need to do it one time only, For writing the XML Dump into memory.)
- One vdt link search takes: 1.8873939514160156 seconds. (We have 70467 vdt.)
- In approximately 38 hours we can be done.