Skip to content

The Power of Countries on the Web

Xiaopeng Li edited this page Aug 31, 2014 · 1 revision

Welcome to the naward06 wiki!

We're a team formed by three master students from TU Delft and participating in the 2014 Norvig Web Data Science Award.

IDEA

One of our team member, Xinyang, used to visit the Trento Museum where he appreciated an interesting visualization in which the number of the times that each country has been mentioned in Wikipedia from 1800 to 2013 are comprehensively and lively displayed. It clearly shows that before 1970, China was rarely mentioned, while afterwards, her occurring frequencies started to soar. This piece of experience enlightened us and became the source of inspiration for this work. We decided to base our project on this and seek for meaningful results within the scope of this competition. Fortunately, we have access to the astronomical number of webpages from Common Crawl, thus providing a great chance for us to mine data of interest within it.

In brief, the idea of our work is to demonstrate the impact of countries in the world by counting the number of the times that a certain country is mentioned in media published on the web (online newspapers, articles, social media posts etc.). We believe that besides conventional indicators such as economic power, often embodied by GDP, the frequencies that countries appear on the web should also be able to demonstrate their impact on the current world.

METHOD

In short, we mine relevant data from Common Crawl, conduct semantic analysis and at last visualize the statistical results.

The data is crawled from Common Crawl data provided on Surf Sara, before which ,some pre-processing is made.

Firstly, since the Web is worldwide and multilingual, the expression of a country name differs from language to language. For example, in English, China is China, in French it is Chine, and in Chinese, it is 中国. Thus, it is necessary and essential to take this factor into consideration by counting the occurrence number of a country's name in all kinds of expressions and finally sum them up.

A semantic analysis has been done firstly using Geopolitical Ontology, after which, we obtained the following file:

"United States", "Americas", "China", "the People's Republic of China", "中国", "Japan", "日本", "the Federal Republic of Germany", "France", "Repubblica francese",

in which all the synonyms of a country are crawled.

Secondly, Pig is used to crawl and count the number of the name of each country mentioned on the Web. By filtering irrelevant webpages, counting occurrence frequencies of the country names in our file and sum them up at the end, we obtain the word counts of each country in the (web) world, partly shown below:

1,America,2740353 2,China,456219 3,Japan,406982 5,France,913474 7,Brazil,194354 9,Italy,476817 10,India,1606035 11,Canada,500419

The bigger the number of a country's name occurrence is, the bigger impact it has upon the world, we assume.

Thirdly, we move on to validate our assumption, i.e. whether the number of a country's name on the Web truly reflect its power. Therefore, we compare what we got with the conventional benchmark, namely the GDP of a country, which is an exact and acknowledged indicator on a country's economical strength. Using Python to further process the data, we crawl the country's GDP statistics in 2013 as the counterpoint to our proposed method.

Lastly, in order to show our result clearly and lively, we use the JavaScript package d3.js to visualize the web occurrence (work counts) and GDP of each country in the world, which is shown on our blog visual4world.

RESULTS

The result actually exceeds our expectation. In our result, we visualize the word counts that we obtained in previous steps and its GDP in 2013 in the same chart. The size of the green filled circle demonstrate a country's GDP, while the hollow circle with blue stroke demonstrates a country's names occurrence on the Web. The value of the two parameters are normalized to a proper scale, meaning that circles showing the largest GDP shares the size with circles showing the largest name occurrence, and this is also true for the smallest ones. In this way, a country's power position compared to all other countries in the world can be properly demonstrated. From the chart, we can see that for most countries in the world, their media impact on the web fit their economy power, illustrated by GDP, simply because the filled circles and hollow circles have promising overlapping. However, exceptions do exist. For instance, China's position in economy is obviously higher than that in the world of media, which, from our personal perspective, might be due to the reason that speaking right is currently held in the hands of western worlds.

DISCUSSION

By making good use of data crawled from Surf Sara, we manage to turn an interesting museum experience into a meaningful work. The statistical results of country name occurrence on the web, which we interpret as country impact on web media, are well validated by relevant GDP data as one of the most important world power indicators. Hence, we conclude that the name occurrence on web of a certain country is capable of demonstrating its impact upon the world. In general, we regard this work as successful and promising, while we admit that several aspects of the experiment can be improved. Firstly, during the process of data mining, only language-wise of country name expression is taken into consideration, while possible aliases or nicknames are ignored. This might cause bias in the statistical results regarding name occurrence. Secondly, the way we normalize the two datasets before the step of visualization needs further validation. Thirdly, it's obvious that among the results we have some exceptions for which name occurrence does not comply with economic power. Further investigation and validation is needed to figure out the cause of these exceptions. The last but not the least, we should demonstrate with evidence that the economic power indicator, i.e. GDP, is a suitable choice to validate our results.

Our work is on : visual4world

By Chen Wang, Xiaopeng Li and Xinyang Gao from Delft University of Technology

Clone this wiki locally