Guide to Statistical Modelling
Introduction – What are Statistical Models?
Statistical models are extremely useful and are extensively used within science and engineering. In many cases, a regular deterministic model is not sufficient, hence a more dynamical, probabilistic, model is used. With the use of such a model, it is possible to answer questions such as “what is the probability that X events will happen within a certain time” and “Within which ranges does the unknown value lie with 95% certainty?”.
All these questions, and many more are fully possible to answer with the help of statistical models.
Concepts in Statistical Modelling
The model itself is usually built up by collecting data by observing or measuring some event. Then, usually the model can be constructed to have a mean, and then a variance, which takes into account the natural variation of events in real life. One simple example would be the temperature for each day of the year. We know that the actual temperature will differ for each day for each year, however, we also know that there is a strong correlation between temperature for the same dates. Hence, this would be a good scenario to use a statistical model. We could define each day to have a mean temperature, and then a variation characterized by its expected “variance”. The variance itself could be modeled as many different kinds of random variables, however one common type to use is Gaussian distributed random variables. These are commonly used and they actually often model well behavior in the real world. And even if something varying is not exactly Gaussian, it might still be a good, or at least useful, assumption to make. There exists several powerful methods and tools to analyze and deal with Gaussian random variables.
Regression analysis is a simple, often computer-aided, way to create models from observed data. One of the best things is that the user does not have to think to much about how to create the model, rather, the computer will calculate it for you and give an expression. However it is always important to think about the expression that you are given, to validate if it is reasonable. For example, your measured data might show a strong correlation between two variables, however this might just be a rubbish result from non-causality or from the fact that there was not enough measurements.
The basic idea is that you are trying to find some connection, or correlation between two variables. For example the size of a house could be one variable, and the sell price could be the other. We expect to see a correlation between the size of the house and the price. Hence it might be possible to design a model to determine, or approximate the price of a new house with a given size. For a case like this, a first good step is to draw, or plot, the historical data against each other, one axis with size, and one with price.
From the figure, we can indeed see some connection between these two variables, as expected. We can continue by using the most basic model, a linear model.
Price = A*(Size) + B
Where A and B are constant regular numbers (scalars). If we do this with the help of Matlab, we obtain
With Matlab we obtain A=1.2 and B=-33. We now need to consider several things. First of all, it is always a good idea to verify if the result is reasonable. In this case, we do believe that the price, at least on average, should rise with increased size. Hence the model does make sense to some degree. The second thing we need to consider is, in which region does this model make sense, in other words, where is it valid. If we consider our model, would it be valid if the size of the house was, lets say, only 10 square meter. We would then have Price = (1.2)*10 – 33 = 12 – 33 = -21 Thousand USD. Hence, we obtained a negative price! It is highly unlikely that someone would pay you that price for you to take over their house! Hence we can confirm that our model is not valid for all ranges. We can however say that probability it is rather acceptable over the range where we had historical data.
The next question is, what other kinds of models can we use? Another idea might be to use a exponential model, if we assume that the house increases with square to the size, we would have
Price = A*(Size)^2 + B*(Size) + C
Where A,B and C are scalars. If we do this in Matlab, we obtain
This models does seem to follow the data a bit closer. In many cases we can assume that this model would be better, it would “fit better”. However, a dangerous phenomena which can occur is “over fitting”, here we describe a simple model with to much complexity, in the hunt of finding a “better fit”, this might sometimes lead to unrealistic results. If we assume that we describe the price with an exponential model with 10 degrees, hence the highest term being (Price)^10. We would obtain
We can see that this model does not make sense, for example look after 150 square meters, the price goes down. The same occurs at 180 square meters. This is a counter-intuitive behavior, caused by over fitting. Strictly mathematically, this model is better, since it fits closer to all the points compared to the linear and square model, however it does not always give a good estimate of price between the historical data points.
Hence, the difficulty is often to find a complex enough model to describe all the features of the data, all the natural variation, without using to much complexity to create over fitting. Hence it is a weigh of between over simplifying and making overly complex.
Here we only discussed about exponential models, however there are several other models which might be useful, for example logarithmic models.
Regression Analysis with Statistics
Until now we only talked about regular regression analysis. We will now introduce probability and statistics concepts to make our models more powerful. Imagine all those regression models mentioned above. Look at the deviation from the measurement points and the lines. It is clear that there is some distance between the model and the points, so the model is not perfect. However, assume that we could model this deviation as a random Gaussian process. If the deviation around the model is Gaussian, or approximately Gaussian, we could use many powerful tools to say a lot more about the model.
If we statistically determine the variance of the deviation, we could then make predictions on future values. We could then finally be able to answer questions such as, for a 95% chance, what within what intervals will a future value be for house size of X square meters. This can be achieved by modelling the deviation as a Gaussian random variable.
In this chapter we learned about the basic considerations when doing regression modelling. There are a lot of available models which could be used, hence it is for the designer to choose the correct or suitable parts when designing. it is also important to try to check if the results make sense intuitively, otherwise strange results might occur.
Further resources for Statistical Modelling
Wikipedia has a good article about regression analysis, it might be a bit mathematical for some people, however it still might give a good overview of the topic.
Regression methods can be done by handmade calculations, however the true power is released when software is used to quickly and efficiently perform the calculations.