Big Data

Probability and Statistics in Big Data

Big data is a word which popularity has dramatically increased in recent years. People have a lot of faith in big data to solve all kinds of problems, such as by using mass amounts of data predict events and trends. This can be used within medical applications such as diagnostics, finance, marketing, product and service improvement.

Big data basically means the usage of data when the amounts are much larger than what can be handled by traditional methods. However, big data has become a buzz word and nowadays it can basically means just dealing with large data sets.

Challenges with Big Data

If you have a small data set, it is easy to implement machine learning algorithms, or do statistical analysis such as linear regression to help analyze and use the data in a useful manner. However, when the data sets grow to large it might not be possible to perform these tasks, due to the sheer amount of data. There might occur memory errors (RAM memory becomes full, hard drives become full etc) or processing issues (it takes to long time to process all data). Hence when you are dealing with mass amounts of data, you have to think more and be smarter. You have to have an understanding of programming and be able to imagine how long it takes to execute different operations. Also it can be helpful to instead of trying to process all data at once, divide the data into subsets and perhaps deal with a bit of data each day, and continue to build upon this.

How to use Statistics in Big Data

How can we use statistics in big data? What are the possibilities? One important topic is sampling. Sampling simply means that you “sample”, or measure some parameter. Imagine for example if we have a temperature reading from a sensor every second, from every day of the year. There are over 31 million seconds in a year, hence we would have 31 million separate values saved together with a date stamp. The question is then, if we for example want to look at temperature trends relating to global warming, do we really need to look at all these samples? Imagine again if you have several different sensor stations, each giving this amount of data each year, then it starts to become difficult to even handle all this data. This is were the subtopic from statistics, sampling, comes in. By studying sampling it is easier to understand how many samples we need, and what kind of results this will give us. Statisticians often give results in terms of something being “statistically significant”.  This usually means that some kind of hypothesis test has been performed to check the redundancy or validity of the results.