Skip to content

Latest commit

 

History

History
33 lines (17 loc) · 5.12 KB

README.md

File metadata and controls

33 lines (17 loc) · 5.12 KB

A Story on Statistics

Dwell deeper to the principles of statistics

Posted on: January 2023


Humans have survived until this day because we are curious about nature. What nature is in this context is not only about natural science such as physics or chemistry, but also natural history and human activity such as DNA or economics.

To understand nature, humans try to model almost everything about it. From mathematical models such as Newton’s law of motion to a designer trying to build a 3D model of a house that they are going to build. A model does not need to be mathematically expressed, a process can also be modeled by language, for example “The sun rises from east to west” is a model on how the sun moves everyday. In fact, language itself is a model of how we think (let's talk about this somewhere in the future).

Newton’s law of motion is indeed a magnificent model that revolutionized industry at that time. But the problem with most models, including Newton's law, is that they are too ideal. For example, let's say you want to travel from Jakarta to Surabaya. If your car constantly travels at the speed of 60 km/h, then a simple linear regression would show that a two hours of travel would get you about 120 km closer to Surabaya. And four hours of travel would get you 240 km closer. Thus you can predict exactly when you will arrive in Surabaya. But unfortunately if you ever travel from a city to another city this is not the case. You will never be able to predict exactly when you will arrive in Surabaya. Maybe you got stuck in a traffic jam somewhere in Semarang, or there are some roadworks in Malang.

The reason we cannot predict natural processes with certainty is variability. Nature cannot escape variability. Your simple linear regression thus is wrong. But don't worry about your travel, you can still use that model even if it is wrong. That is because all models are wrong, but some models are better than the others.

So how do we make our models better than other models? The answer is observation. By observing a process we will get data. We then use statistics to tell stories about the behavior of the process. Statistics is the way to deal with this variability. For example, to quantify the variability of the time to travel from Jakarta to Surabaya, we can look up data of the same case. Maybe your friend traveled from Jakarta to Surabaya last week and they arrived in 9 hours 38 minutes. And then another friend of yours also has traveled from Jakarta to Surabaya multiple times and they said it takes about 10 hours 11 minutes. Thus based on this observation you know that you will arrive roughly between 9 hours 38 minutes and 10 hours 11 minutes. Obviously by doing more observation, you will arrive at a better guess.

Let's dwell deeper on the principles of statistics. In his book, Peter Westfall and Kevin Henning say that there are 3 ‘mantras’ about statistics that we need to know. I like to call these mantras the principles of statistics.

The first principle is "model produce data and not vice versa". This sounds obvious at first but most people do not incorporate this principle when they are doing statistical analysis, and in fact they are usually doing it wrong. What we are doing with statistics is to create an artificial model so that our artificial model produces the same data as the true model produces. We are not trying to create a model from data. This is what most people do with statistics, and it is wrong. They assume that the data they observed is the end point. The truth is our observed data might be the victim of variability. One example to deal with this is to state not only the means but also the confidence interval of the means.

The second principle is "model has unknown parameters". We try to make our artificial model the same as the true process model. But the truth is we will never know with 100% accuracy the true process model. No matter how many observations we make, we will never be able to model the times it takes to travel from Jakarta to Surabaya with 100% accuracy. This is why earlier I said all models are wrong.

But don't worry, because the last principle of statistics is "data reduces the uncertainty of the parameters". We will never know the true process model. But by observing the true process behavior, we can reduce our uncertainty about the true process. We will never know how many hours it takes to travel from Jakarta to Surabaya but we can collect data and state an interval of the times it takes to travel from Jakarta to Surabaya.

This is the bread and butter of statistics. If you can understand these principles, it will be easier for you to understand most statistical tools. People often have misconception that data is the holy grail of statistics. They thought If the data say so then it's true. This is clearly false. What statistics is trying to do is create an artificial model so our artificial model can mimic the true model. We don't know the true model. But by observing the process we will get our artificial model as close as possible to the true model.

Further Reading

  • ‘Understanding Advanced Statistical Methods’ by Peter H. Westfall and Kevin S. S. Henning.