Script for processing wikipedia dumps (in any language) and extracting useful metadata (inter-language links, how often a string refers to a wikipage etc.) from it.
Install the requirements, modify the makefile appropriately, and run.
This repository contains scripts to perform the following preprocessing steps.
- Download the relevant files from the wikipedia dump (target
dumps
inmakefile
). Specifically, it downloads
*-pages-articles.xml.bz2
*-page.sql.gz
*-pagelinks.sql.gz
*-redirect.sql.gz
*-categorylinks.sql.gz
*-langlinks.sql.gz
-
Extract text with hyperlinks from the *pages-articles.xml.bz2 file (target
text
inmakefile
), using the wikiextractor. -
Create a inter-language link mapping from Wikipedia titles to English Wikipedia titles using *langlinks.sql.gz (target
langlinks
inmakefile
). Inter-language links indicate that the page Barack_Obama in English Wikipedia is for the same entity as the page बराक_ओबामा in Hindi Wikipedia. -
Compute hyperlink counts (how many hyperlinks point to a certain title) for wikipedia titles (target
countsmap
inmakefile
). This is basically inlink counts for each title. -
Compute probability indices using which we can compute the probability for a string (e.g., Berlin) referring a Wikipedia title (e.g., Berlin_(Band)) (target
probmap
inmakefile
).
Major output files are explained below:
Creates Wikipedia page id to page title map using *page.sql.gz (target id2title
in makefile
). The result is saved in ${OUTDIR}/${lang}wiki/idmap/${lang}wiki-data.id2t
Every Wikipedia page is associated with a unique page id. For instance, the page Barack_Obama in the English Wikipedia has the page id 534366. You can verify this by visiting https://en.wikipedia.org/?curid=534366 or visiting the page information link on the Tools panel on the left on the Wikipedia page. This page id serves as the canonical identifier of the page, and is used in other dump files (e.g., enwiki-*-redirect.sql.gz etc.) to refer to the page.
The output map is a tsv file that looks like this (example from Turkish wiki dump for 20181020):
10 Cengiz_Han 0
16 Film_(anlam_ayrımı) 0
22 Mustafa_Suphi 0
24 Linux 0
25 MHP 1
Each line represents an entry for one page, where the first field is the page id, the second field is the page title, and the third field is a boolean indicating whether the page is redirection.
In this precessing steps, for each dumped wiki file, we create 2 json files that summarize the information for each pages: wiki_{no.}.json and wiki_{no.}.json.brief.
The processed json files are saved in ${OUTDIR}/${lang}link_in_pages
.
Those information are later used to create training dataset.
In wiki_{no.}.json files, for each wiki page, we store:
title: the page title
curid: the wikipage id
text: the raw text of this page, with all hyperlinks removed
linked_spans: a list of all the words appeared in this page that has an outlink to some other page.
For those words, we record their starting and ending character position.
An example from part of a turkish language wiki.json file (from 2019/05/01 dump):
{
"title": "Kimya",
"curid": "58",
"text": "\nKimya\n\nKimya, maddenin yap......"
"linked_spans": [
{
"label": "Madde",
"end": 20,
"start": 15
},
{
"label": "'Kimyasal_reaksiyon'",
"end": 86,
"start": 79
}, ...
]
}
The wiki_{no.}.json.brief file contains only the curid, title and raw text. An example of the same wikipage as above:
{
"title": "Kimya",
"curid": "58",
"text": "\nKimya\n\nKimya, maddenin yap......"
}
The xling-el project for cross-lingual entity linking requires training data to be provided in a certain format. Generating this data from wikipedia text is handled in the mid
target in the makefile. The training data format is the following fields in a tab separated file,
a. The freebase mid of the wikipedia page.
b. The wikipedia page title.
c. Start token offset of the mention.
d. End token offset of the mention.
e. The mention string.
f. The context around (and including) the mention, of a certain window size.
g. All other mentions in the same document as the current mention.
The output tab separated files are saved in ${OUTDIR}/${lang}mid
.
Here is a line of example output from Turkish wiki (from 2018/11/01 dump):
MID 163500 Krokau 4 4 Almanya Almanya Schleswig-Holstein Plön_(il) Almanya'nın_belediyeleri 31_Aralık 2015
The tab-separated fields, are, from left to right:
-
MID keyword
-
Page ID of the Wiki page
-
Normalized page title
-
Start index of the tokens that contains a mention
-
End index of the tokens that contains a mention
-
The context for the mention. It contains n characters before and after the mention, where n is the window size.
-
All mentions in the same page.
Redirects map using *redirect.sql.gz (target redirects
in makefile
).
Redirects tell you that the wikipedia link POTUS44 redirects to the page Barack_Obama in the English Wikipedia.
You need python >=3.5. Also install the following two packages.
pip3 install bs4
pip3 install hanziconv # (for chinese traditional to simplified conversion)
For ease of use, we provide a makefile
that specifies targets to automatically run all processing scripts. To use the makefile, you need to
-
Download/Clone wikiextractor. Modify path
WIKIEXTRACTOR
in makefile to point to it. -
Create a download directory for wikipedia dumps (say
/path/to/dumpdir
) and setDUMPDIR
accordingly. The wikipedia dumps will be downloaded underDUMPDIR
(for instance the Turkish Wikipedia dumps will be downloaded underDUMPDIR/trwiki/
)
For Cogcomp Internal Use:
Wikipedia dumps are already available under /shared/corpora/wikipedia_dumps
, so simply set the DUMPDIR
to /shared/corpora/wikipedia_dumps
. For instance, the Turkish wikipedia resources are in /shared/corpora/wikipedia_dumps/trwiki
.
-
Set the
lang
variable to the two-letter language code used by Wikipedia to identify the language (eg.tr
for Turkish,es
for Spanish etc.) -
Specify a
OUTDIR
. This is the directory where the resources will be generated (eg.path/to/my/resources/trwiki
for Turkish Wikipedia). To keep the code generic, you may want to use thelang
variable to define theOUTDIR
(e.g.,path/to/my/resources/${lang}wiki
). -
Modify the
DATE
variable to identify the timestamp of the Wikipedia dump to download. Make sure that this link workshttps://dumps.wikimedia.org/${lang}wiki/${DATE}/
. -
Make sure
PYTHONBIN
points to the correct python binary. -
Run the command
make all
. This should perform all the preprocessing steps above by following the build dependencies specified in the makefile.
After make all
completes successfully (takes ~18 mins on single-core machine for Turkish Wikipedia), you should have files with following line counts (for 20180720 dump of Turkish Wikipedia),
222367 idmap/fr2entitles
559553 idmap/trwiki-20180720.id2t
247338 idmap/trwiki-20180720.r2t
559552 trwiki-20180720.counts
2941652 surface_links
936100 probmap/trwiki-20180720.p2t2prob
936100 probmap/trwiki-20180720.t2p2prob
1426771 probmap/trwiki-20180720.t2w2prob
745829 probmap/trwiki-20180720.tnr.p2t2prob
745829 probmap/trwiki-20180720.tnr.t2p2prob
1273216 probmap/trwiki-20180720.tnr.t2w2prob
1273216 probmap/trwiki-20180720.tnr.w2t2prob
1426771 probmap/trwiki-20180720.w2t2prob
If you use this code, please cite
@inproceedings{UGR18,
author = {Upadhyay, Shyam and Gupta, Nitish and Roth, Dan},
title = {Joint Multilingual Supervision for Cross-lingual Entity Linking},
booktitle = {EMNLP},
year = {2018}
}